FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
379 stars 101 forks source link

No output produced from RRBS analysis #333

Closed talonsensei closed 4 years ago

talonsensei commented 4 years ago

Hello, new user here. Installed latest version of Bismark(0.22.3), HISAT2 (2.2.0), samtools (1.10). Running analysis on data downloaded from BaseSpace that was previously analysed with another pipeline, so I know what the results should be. Result from Bismark is empty bam file and no Cys methylation called. FastQC results attached.

It appears that the converted input files mate1_C_to_T.fastq and mate2_G_to_A.fastq are empty. So everything downstream fails.

Any recommendations are appreciated. -Thanks

fastqc_files.zip

Run command: bismark --hisat2 --genome_folder /data/grch37 -1 mate1 -2 mate2

mate1 contents: 9-14414-YES-STUDY_S9_L001_R1_001.fastq.gz mate2 contents: 9-14414-YES-STUDY_S9_L001_R2_001.fastq.gz

Path to HISAT2 specified as: hisat2 HISAT2 seems to be working fine (tested command 'hisat2 --version' [2.2.0]) Output format is BAM (default) Alignments will be written out in BAM format. Samtools found here: '/usr/local/bin/samtools' Reference genome folder provided is /data/grch37/ (absolute path is '/data/grch37/)' FastQ format assumed (by default)

Input files to be analysed (in current folder '/data/test2/9_L001_ds.8c5c9cd15cc44b9783f7208281a30203'): mate1 mate2 Library is assumed to be strand-specific (directional), alignments to strands complementary to the original top or bottom strands will be ignored (i.e. not performed!)Setting parallelization to single-threaded (default)

Summary of all aligner options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500 --no-softclip --omit-sec-seq Current working directory is: /data/test2/9_L001_ds.8c5c9cd15cc44b9783f7208281a30203

Now reading in and storing sequence information of the genome specified in: /data/grch37/

chr 1 (249250621 bp) chr 10 (135534747 bp) chr 11 (135006516 bp) chr 12 (133851895 bp) chr 13 (115169878 bp) chr 14 (107349540 bp) chr 15 (102531392 bp) chr 16 (90354753 bp) chr 17 (81195210 bp) chr 18 (78077248 bp) chr 19 (59128983 bp) chr 2 (243199373 bp) chr 20 (63025520 bp) chr 21 (48129895 bp) chr 22 (51304566 bp) chr 3 (198022430 bp) chr 4 (191154276 bp) chr 5 (180915260 bp) chr 6 (171115067 bp) chr 7 (159138663 bp) chr 8 (146364022 bp) chr 9 (141213431 bp) chr MT (16569 bp) chr X (155270560 bp) chr Y (59373566 bp) chr GL000192.1 (547496 bp) chr GL000225.1 (211173 bp) chr GL000194.1 (191469 bp) chr GL000193.1 (189789 bp) chr GL000200.1 (187035 bp) chr GL000222.1 (186861 bp) chr GL000212.1 (186858 bp) chr GL000195.1 (182896 bp) chr GL000223.1 (180455 bp) chr GL000224.1 (179693 bp) chr GL000219.1 (179198 bp) chr GL000205.1 (174588 bp) chr GL000215.1 (172545 bp) chr GL000216.1 (172294 bp) chr GL000217.1 (172149 bp) chr GL000199.1 (169874 bp) chr GL000211.1 (166566 bp) chr GL000213.1 (164239 bp) chr GL000220.1 (161802 bp) chr GL000218.1 (161147 bp) chr GL000209.1 (159169 bp) chr GL000221.1 (155397 bp) chr GL000214.1 (137718 bp) chr GL000228.1 (129120 bp) chr GL000227.1 (128374 bp) chr GL000191.1 (106433 bp) chr GL000208.1 (92689 bp) chr GL000198.1 (90085 bp) chr GL000204.1 (81310 bp) chr GL000233.1 (45941 bp) chr GL000237.1 (45867 bp) chr GL000230.1 (43691 bp) chr GL000242.1 (43523 bp) chr GL000243.1 (43341 bp) chr GL000241.1 (42152 bp) chr GL000236.1 (41934 bp) chr GL000240.1 (41933 bp) chr GL000206.1 (41001 bp) chr GL000232.1 (40652 bp) chr GL000234.1 (40531 bp) chr GL000202.1 (40103 bp) chr GL000238.1 (39939 bp) chr GL000244.1 (39929 bp) chr GL000248.1 (39786 bp) chr GL000196.1 (38914 bp) chr GL000249.1 (38502 bp) chr GL000246.1 (38154 bp) chr GL000203.1 (37498 bp) chr GL000197.1 (37175 bp) chr GL000245.1 (36651 bp) chr GL000247.1 (36422 bp) chr GL000201.1 (36148 bp) chr GL000235.1 (34474 bp) chr GL000239.1 (33824 bp) chr GL000210.1 (27682 bp) chr GL000231.1 (27386 bp) chr GL000229.1 (19913 bp) chr GL000226.1 (15008 bp) chr GL000207.1 (4262 bp)

Single-core mode: setting pid to 1

Paired-end alignments will be performed

The provided filenames for paired-end alignments are mate1 and mate2 Input files are in FastQ format Writing a C -> T converted version of the input file mate1 to mate1_C_to_T.fastq

Created C -> T converted version of the FastQ file mate1 (0 sequences in total)

Writing a G -> A converted version of the input file mate2 to mate2_G_to_A.fastq

Created G -> A converted version of the FastQ file mate2 (0 sequences in total)

Input files are mate1_C_to_T.fastq and mate2_G_to_A.fastq (FastQ) Now running 2 instances of HISAT2 against the bisulfite genome of /data/grch37/ with the specified options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500 --no-softclip --omit-sec-seq

Now starting a HISAT2 paired-end alignment for CTread1GAread2CTgenome (reading in sequences from mate1_C_to_T.fastq and mate2_G_to_A.fastq, with the options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500 --no-softclip --omit-sec-seq --norc)) 0 reads 0.00% overall alignment rate Found no alignment, assigning undef to last_seq_id and last_lines Now starting a HISAT2 paired-end alignment for CTread1GAread2GAgenome (reading in sequences from mate1_C_to_T.fastq and mate2_G_to_A.fastq, with the options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --maxins 500 --no-softclip --omit-sec-seq --nofw)) 0 reads 0.00% overall alignment rate Found no alignment, assigning undef to last_seq_id and last_lines

Writing bisulfite mapping results to mate1_bismark_hisat2_pe.bam <<<

Reading in the sequence files mate1 and mate2 Processed 0 sequences in total

Successfully deleted the temporary files mate1_C_to_T.fastq and mate2_G_to_A.fastq

Final Alignment report

Sequence pairs analysed in total: 0 Number of paired-end alignments with a unique best hit: 0 Mapping efficiency: 0%

Sequence pairs with no alignments under any condition: 0 Sequence pairs did not map uniquely: 0 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 0 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 0 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 0

Total methylated C's in CpG context: 0 Total methylated C's in CHG context: 0 Total methylated C's in CHH context: 0 Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 0 Total unmethylated C's in CHG context: 0 Total unmethylated C's in CHH context: 0 Total unmethylated C's in Unknown context: 0

Can't determine percentage of methylated Cs in CpG context if value was 0 Can't determine percentage of methylated Cs in CHG context if value was 0 Can't determine percentage of methylated Cs in CHH context if value was 0 Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0

Bismark completed in 0d 0h 0m 35s

==================== Bismark run complete

talonsensei commented 4 years ago

It does work if I put the actual sequence file names for the -1 and -2 option instead of trying to use a file of names. I guess the file of names options fails silently when there is only one. I also tried having 1 name followed by a comma (like in a python list ['foo', ] but that didn't work either.

Command run: bismark --hisat2 --genome_folder /data/grch37 -1 9-14414-YES-STUDY_S9_L001_R1_001.fastq.gz -2 9-14414- YES-STUDY_S9_L001_R2_001.fastq.gz

Result: Final Alignment report

Sequence pairs analysed in total: 223144 Number of paired-end alignments with a unique best hit: 10774 Mapping efficiency: 4.8%

Sequence pairs with no alignments under any condition: 212276 Sequence pairs did not map uniquely: 94 Sequence pairs which were discarded because genomic sequence could not be extracted: 0

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 6645 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 4129 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 545083

Total methylated C's in CpG context: 142195 Total methylated C's in CHG context: 377 Total methylated C's in CHH context: 472 Total methylated C's in Unknown context: 4

Total unmethylated C's in CpG context: 23053 Total unmethylated C's in CHG context: 144560 Total unmethylated C's in CHH context: 234426 Total unmethylated C's in Unknown context: 54

C methylated in CpG context: 86.0% C methylated in CHG context: 0.3% C methylated in CHH context: 0.2% C methylated in unknown context (CN or CHN): 6.9%

Bismark completed in 0d 0h 0m 54s

==================== Bismark run complete

FelixKrueger commented 4 years ago

Hi @talonsensei

Yes indeed, the input files need to be supplied like this

-1 file1_R1.fastq.gz -2 file1_R2.fastq.gz

If you want to align more than one sequence pair, you can use a comma-separated list like so:

-1 file1_R1.fastq.gz,file2_R1.fastq.gz,file3_R1.fastq.gz -2 file1_R2.fastq.gz,file2_R2.fastq.gz,file3_R2.fastq.gz

Or use a script or loop to run one sequence pair at a time.

What happened in your case was indeed that the files did not undergo C->T conversion at all as they only contained a single line (and the process stops if a sequence does not have all four lines required for a FastQ file):

Writing a C -> T converted version of the input file mate1 to mate1_C_to_T.fastq
Created C -> T converted version of the FastQ file mate1 (0 sequences in total)

Writing a G -> A converted version of the input file mate2 to mate2_G_to_A.fastq
Created G -> A converted version of the FastQ file mate2 (0 sequences in total)

By the way, do the results look sensible? I looked at the FastQC reports and must say that the files look very funky indeed! Which kind of sequencing is this?

talonsensei commented 4 years ago

Hi Felix,

Thanks for the clarification. As for the FastQC results, I'm not entirely sure to be honest, I just started doing this kind of analysis. This is targeted amplicon Bisulfite sequencing. What aspect of the result is funky? I think I sent you the original FastQC before it was trimmed using BBduk. That improved the Bismark results significantly, but I'm still not sure if they are "right"

Thanks,

Mark

On Tue, Apr 28, 2020 at 12:38 AM Felix Krueger notifications@github.com wrote:

Hi @talonsensei https://github.com/talonsensei

Yes indeed, the input files need to be supplied like this

-1 file1_R1.fastq.gz -2 file1_R2.fastq.gz

If you want to align more than one sequence pair, you can use a comma-separated list like so:

-1 file1_R1.fastq.gz,file2_R1.fastq.gz,file3_R1.fastq.gz -2 file1_R2.fastq.gz,file2_R2.fastq.gz,file3_R2.fastq.gz

Or use a script or loop to run one sequence pair at a time.

What happened in your case was indeed that the files did not undergo C->T conversion at all as they only contained a single line (and the process stops if a sequence does not have all four lines required for a FastQ file):

Writing a C -> T converted version of the input file mate1 to mate1_C_to_T.fastq Created C -> T converted version of the FastQ file mate1 (0 sequences in total)

Writing a G -> A converted version of the input file mate2 to mate2_G_to_A.fastq Created G -> A converted version of the FastQ file mate2 (0 sequences in total)

By the way, do the results look sensible? I looked at the FastQC reports and must say that the files look very funky indeed! Which kind of sequencing is this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/333#issuecomment-620436001, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYWWEF4P75PME737GHWPMDRO2BWTANCNFSM4MSJIOPA .

FelixKrueger commented 4 years ago

I suppose because since they came from BaseSpace the data had been (quite heavily) trimmed by the Illumina automated system before. Quite a lot of the sequences were short, the base composition was also 'interesting'. We tend to take the original (untrimmed) FastQ files, and run them through a default run of Trim Galore, which normally does exactly what you need.

Depending on the nature of the amplicon sequencing you might want to do a quick test run in --non_directional mode to see if there are any alignments coming from the CTOT or CTOB strands (that depends a little on the primer design). If you find there that most alignments come from the OT and OB strands (like in the example report above), then you should continue using the data as it is, i.e. with the default (=directional) processing.

Good luck!

talonsensei commented 4 years ago

Great, thank you so much, I will try your suggestion.

-Mark

On Apr 28, 2020, at 7:25 AM, Felix Krueger notifications@github.com wrote:

I suppose because since they came from BaseSpace the data had been (quite heavily) trimmed by the Illumina automated system before. Quite a lot of the sequences were short, the base composition was also 'interesting'. We tend to take the original (untrimmed) FastQ files, and run them through a default run of Trim Galore, which normally does exactly what you need.

Depending on the nature of the amplicon sequencing you might want to do a quick test run in --non_directional mode to see if there are any alignments coming from the CTOT or CTOB strands (that depends a little on the primer design). If you find there that most alignments come from the OT and OB strands (like in the example report above), then you should continue using the data as it is, i.e. with the default (=directional) processing.

Good luck!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/FelixKrueger/Bismark/issues/333#issuecomment-620641033, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYWWEAPFBAFMKCAIXLMAQLRO3RNLANCNFSM4MSJIOPA.