FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

Bismark Alignment Inquiry #654

Open KeemAntonio opened 4 months ago

KeemAntonio commented 4 months ago

Hi Felix,

I have a couple of questions regarding Bismark. I'm sorry if they are stupid questions; I am fairly new to WGBS data and Bismark analysis (or to bioinformatics in general).

We are researching non-model fish animals. Recently, their genomes have become available in the NCBI database, and their assembly level is Scaffold (Additionally, it is still unannotated). Concurrently, there is another project in our laboratory to sequence our target organisms. I, in particular, have been tasked with analyzing WGBS data from our target animal. I already have the raw WGBS reads (Library type: Accel Methyl-Seq DNA library), and I have already trimmed them using Trim Galore (--paired --fastqc). While we are waiting for the genome of our target samples from our own laboratory (we had a few hiccups with our genome sequencing, which is why our WGBS reads came first), can I use a different genome (that is already annotated) from another species to align my WGBS reads?

I tried aligning our methylome reads to the genome of D. rerio using the default parameters (-q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)), which resulted in a mapping efficiency of 0.1%. I also tried aligning my WGBS reads to the unannotated genome of our target species (the one I mentioned earlier that has been deposited to the NCBI database) using also default parameters (-q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)), and I only achieved 31% mapping efficiency. I am reading from your GitHub that I should use --score-min L,0,-0.6, which I will be trying. My main question is, is it alright to use the unannotated genome as my reference for Bismark alignment (I will also try aligning our methylome reads to the genome that was produced by our laboratory once it is finished)? And does Bismark only align methylome reads from the same species, or will my methylome reads be able to align to the genomes of closely related species?

Thank you

FelixKrueger commented 4 months ago

Hi Antonio,

Welcome to the world of methylation alignments! As a first comment, Accel Swift typically requires some specific trimming, namely

--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10

You can of course try to align to a close species at first to get some experience, and then perform it once more you've got a good reference sequence available. The issue that you have experienced already is that the more distantly related the species are, the lower the mapping efficiency will be. The default mapping parameters for Bismark are deliberately very strict, so increasing it to --score-min L,0,-0.6 will allow 3 times as many errors as the default mode.

As another issue, you should really re-do the trimming first, this should also greatly enhance the mapping efficiency (while reducing errors).

And finally, as soon as you have a reference sequence to align to, you can use Bismark, as it doesn't do anything with the annotations themselves. Once the annotation is at hand, you can rather use it as an additional layer for downstream analysis.

KeemAntonio commented 4 months ago

Hi Felix!

Thank you for your prompt response! we have re-done trimming using <--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10> in TrimGalore! we have seen significant improvement as you can see in per base sequence new trim we have also seen a significant increase on the mapping efficiency of one of our samples on one of our target organism (I am analyzing two different species of non-model fish) with a jump from ~30% to 51% using the newly trimmed reads (bismark Parameter: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)). we have also tried aligning using --score-min L,0,-0.4 and got an increase from our old trimmed sequences (57.7%) comparing it to new trimmed sequences using the suggested parameter (66.9%). Lastly, we also aligned using --score-min L,0,-0.6 and got an increase from 72.9% to 77.0%. Thank you for this!

I just have a few more questions. since I am working on another species of fish we tried doing the same steps/parameters for trimming (--clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10) but when we aligned it to its assembled genome that we got from NCBI database (note that we are still in the midst of our own genome assembly) we only got a 1.2% mapping efficiency using this parameter (-q --score-min L,0,-0.6 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)) here is the final report:

Final Alignment report

Sequence pairs analysed in total: 94422002 Number of paired-end alignments with a unique best hit: 1156101 Mapping efficiency: 1.2% Sequence pairs with no alignments under any condition: 93197502 Sequence pairs did not map uniquely: 68399 Sequence pairs which were discarded because genomic sequence could not be extracted: 4698

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 573176 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 578227 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 56000878

Total methylated C's in CpG context: 3779225 Total methylated C's in CHG context: 236041 Total methylated C's in CHH context: 537007 Total methylated C's in Unknown context: 15090

Total unmethylated C's in CpG context: 3543149 Total unmethylated C's in CHG context: 11928857 Total unmethylated C's in CHH context: 35976599 Total unmethylated C's in Unknown context: 292947

C methylated in CpG context: 51.6% C methylated in CHG context: 1.9% C methylated in CHH context: 1.5% C methylated in Unknown context (CN or CHN): 4.9%

I have a feeling that the available genome that I am using in this species is highly fragmented

the assembly statistics is:

Genome size | 709.4 Mb Total ungapped length | 577.7 Mb Number of scaffolds | 191,173 Scaffold N50 | 11.4 Mb Scaffold L50 | 22 Number of contigs | 281,720 Contig N50 | 2.6 kb Contig L50 | 67,375 GC percent | 44.5 Genome coverage | 65.0x Assembly level | Scaffold

do you have any recommendations on how to proceed on this problem? Thank you so much!

FelixKrueger commented 4 months ago

Hmm, a 1% mapping efficiency is very low indeed. This typically happens either when you have the wrong library type (can you attache the R1/R2 FastQC base composition plots?), or when you align to a completely wrong genome (e.g. human/mouse), or possibly if there were some grave technical errors, e.g. in R2?

KeemAntonio commented 4 months ago

Hi Felix! Thank you for responding. I'm attaching the fastqc report here for one pair of my samples:

R1: per_sequence_gc_content per_sequence_quality per_tile_quality sequence_length_distribution adapter_content duplication_levels per_base_n_content per_base_quality per_base_sequence_content

R2: adapter_content duplication_levels per_base_n_content per_base_quality per_base_sequence_content per_sequence_gc_content per_sequence_quality per_tile_quality sequence_length_distribution

I am also trying to see if my sequences are contaminated just to make sure that my samples are really the species that I collected. Do you have any recommendations on what software that I can use for this task? I'm reading that Fastq screen can screen contaminants and can be used for bisulfite sequences.

Regarding your last suggestion/comment, how do we assess if there is a grave technical error in our sequences?

Thank you!

FelixKrueger commented 4 months ago

Thanks for all these plots (they are from the 5' clipped samples, correct?). Overall they all look fine, there are no signs of technical or unexpected issues in the reads (e.g. in the quality, or tile graphs, or the N-count).

My next best guess would be that the sample are simply not from the organism you think they are from. If you wanted you could send me small subset of reads (e.g. 200,000, untrimmed, fastq.gz compressed) via email, and I could take a quick look?

KeemAntonio commented 3 months ago

Hi Felix! sorry for the late reply. Our lab had a field work for 3 weeks field work. May I ask what software would you suggest for me to be able to get a small subset of my sequences for the second fish species (fish2)? may I also ask for your email? is this email correct: fkrueger@altoslabs.com?

Currently we were done with the alignment of the first fish species ( Fish1; the one that I described that the mapping efficiency jumped from 30% to 50% using the default parameters of Bismark (bismark Parameter: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e., not performed)). Here is the result for report generation:

newplot (6) newplot (7) Bismark M-bias Read 1 Bismark M-bias Read 2 newplot (3) newplot (4) newplot (5)

I just have another question for Fish1:

during alignment, bismark is saying that "Chromosomal sequence could not be extracted for..." as shown in the figure below. is this to be expected since we are using genome at the scaffold level? image

FelixKrueger commented 3 months ago

Yes, these warnings show up more frequently if you have short chromosomes or scaffolds. Typically the number is negligibly small, in your case ~0.05%. Just ignore.

To interact with reads you can just use gzip/gunzip, e.g. like so:

gunzip -c inputfile.fastq.gz | head -800000 | gzip - > output_200K.fastq.gz
FelixKrueger commented 3 months ago

Thanks for the sample files. The sample seems to have been processed using the Accel Swift kit, which introduces some hefty biases, especially at the start of Read 2:

Screenshot 2024-03-25 at 13 53 49

As a solution to this you just need to clip your reads from all ends, like so:

trim_galore --clip_R1 10 --clip_R2 15 --three_prime_clip_R1 10 --three_prime_clip_R2 10

I am convinced after that the reads will align just fine. Let me know how you get on!

KeemAntonio commented 4 weeks ago

Hi Felix! thanks again for accommodating my questions regarding the alignment of our samples. we had a major hiccup for our project since there was an issue about the identity of one of our target fish species that is why our wgbs data cannot align properly with the available genome in NCBI (with mapping efficiency of ~0.1%). regardless, our group is currently working on the other fish species that have a much better alignment (the one that you checked; representative alignment report below:)

[specified options: -q --score-min L,0,-0.2 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)]

Final Alignment report

Sequence pairs analysed in total: 109077988 Number of paired-end alignments with a unique best hit: 55975773 Mapping efficiency: 51.3% Sequence pairs with no alignments under any condition: 52764057 Sequence pairs did not map uniquely: 338158 Sequence pairs which were discarded because genomic sequence could not be extracted: 56335

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 27777757 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 28141681 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 2555498641

Total methylated C's in CpG context: 185233029 Total methylated C's in CHG context: 2538723 Total methylated C's in CHH context: 8000082 Total methylated C's in Unknown context: 58299

Total unmethylated C's in CpG context: 55597280 Total unmethylated C's in CHG context: 561300346 Total unmethylated C's in CHH context: 1742829181 Total unmethylated C's in Unknown context: 1314730

C methylated in CpG context: 76.9% C methylated in CHG context: 0.5% C methylated in CHH context: 0.5% C methylated in unknown context (CN or CHN): 4.2%

image

I also tried using less stringent alignment parameters (see representative report below:)

[specified options: -q --score-min L,0,-0.6 --ignore-quals --no-mixed --no-discordant --dovetail --maxins 500 Option '--directional' specified (default mode): alignments to complementary strands (CTOT, CTOB) were ignored (i.e. not performed)]

Final Alignment report

Sequence pairs analysed in total: 109378293 Number of paired-end alignments with a unique best hit: 83632232 Mapping efficiency: 76.5% Sequence pairs with no alignments under any condition: 24581397 Sequence pairs did not map uniquely: 1164664 Sequence pairs which were discarded because genomic sequence could not be extracted: 545836

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 40287619 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 42798777 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 3712028338

Total methylated C's in CpG context: 272265530 Total methylated C's in CHG context: 4863260 Total methylated C's in CHH context: 15651691 Total methylated C's in Unknown context: 359661

Total unmethylated C's in CpG context: 94436141 Total unmethylated C's in CHG context: 804207651 Total unmethylated C's in CHH context: 2520604065 Total unmethylated C's in Unknown context: 8702049

C methylated in CpG context: 74.2% C methylated in CHG context: 0.6% C methylated in CHH context: 0.6% C methylated in unknown context (CN or CHN): 4.0%

we also had done deduplication using --score-min L,0,-0.2 results/alignment (see representative report below:)

_bismark_bt2_pe.bam: 55919438 Total number duplicated alignments removed: 7134894 (12.76%) Duplicated alignments were found at: 6149664 different position(s) Total count of deduplicated leftover sequences: 48784544 (87.24% of total) as of right now, I want to do the next step of the pipeline which is methylation calling. however, my concern is that most of the papers that I've read says that I should use annotated genome during methylation calling however, our current available genome is still being annotated and what we have is only the assembled genome. is it ok to use the only assembled genome for methylation calling and for subsequent downstream analyses (DMRs/DMGs)? so far I tried to run methylation calling for one sample (see report below:) [script/commands that I used: bismark_methylation_extractor --comprehensive --bedGraph --gzip --scaffolds --cytosine_report --genome_folder] after I ran methylation calling, I got these output files: ![image](https://github.com/FelixKrueger/Bismark/assets/145742422/4b237cad-d73e-4530-b04b-9f8770559ac5) ![image](https://github.com/FelixKrueger/Bismark/assets/145742422/dd9873e2-c7ff-4a31-9f7a-3e3392d65aaa) When I opened the [Sample]_bismark_bt2_pe.deduplicated.cytosine_context_summary.txt file, there was no reports that are in the txt files. however [Sample]_pe.deduplicated.M-bias.txt and [Sample]_bismark_bt2_pe.deduplicated_splitting_report.txt had reports (see representative report of [Sample]_bismark_bt2_pe.deduplicated_splitting_report.txt below:) [Sample]bismark_bt2_pe.deduplicated.bam Parameters used to extract methylation information: Bismark Extractor Version: v0.23.1 Bismark result file: paired-end (SAM format) Output specified: comprehensive No overlapping methylation calls specified Processed 48784544 lines in total Total number of methylation call strings processed: 97569088 Final Cytosine Methylation Report ================================= Total number of C's analysed: 1891803435 Total methylated C's in CpG context: 136624254 Total methylated C's in CHG context: 1889811 Total methylated C's in CHH context: 5987407 Total C to T conversions in CpG context: 40920797 Total C to T conversions in CHG context: 415950132 Total C to T conversions in CHH context: 1290431034 C methylated in CpG context: 77.0% C methylated in CHG context: 0.5% C methylated in CHH context: 0.5% is my approach correct?? I hope that you can again help me with this and thank you in advance for answering!
FelixKrueger commented 3 weeks ago

Technically, the methylation already takes place during the alignment step. As the name suggests, the methylation extractor merely extracts the calls and puts them into different formats, which is described in more detail here: https://felixkrueger.github.io/Bismark/bismark/methylation_extraction/

One thing to mention here would be that everything up to and including the coverage comes from the original (or deduplicated) BAM files. Only the last step, generating a cytosine report, requires the genome to be specified again. This genome hast to be the same genome as the one used for the alignment step. Again, more details can be found at the link above. Does that make sense?