FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
366 stars 101 forks source link

alignment problem with reference genome #661

Closed Silvia-Giorgia-Signorini closed 3 months ago

Silvia-Giorgia-Signorini commented 3 months ago

Hello Felix, I’m a PhD student and I’m currently working on epigenetics data (DNA methylation) with limpets. I work in parallel with two different species, Patella caerulea and Patella ulyssiponensis, but I have some problems with the alignment of P. ulyssiponensis sequences to its reference genome. I know that it is a problem of the reference genome because I’ve aligned the sequence of P. ulyssiponensis to the other reference genome that I have (P. caerulea) and it worked well, even if the map efficiency was low (10.8%). The genome preparation was the same for the two species. And I also know that the script for the alignment is ok because it worked well with P. caerulea.

Final Alignment report

Sequence pairs analysed in total: 13359954 Number of paired-end alignments with a unique best hit: 5061646 Mapping efficiency: 37.9% Sequence pairs with no alignments under any condition: 6812837 Sequence pairs did not map uniquely: 1485471 Sequence pairs which were discarded because genomic sequence could not be extracted: 5061646

Number of sequence pairs with unique best (first) alignment came from the bowtie output: CT/GA/CT: 0 ((converted) top strand) GA/CT/CT: 0 (complementary to (converted) top strand) GA/CT/GA: 0 (complementary to (converted) bottom strand) CT/GA/GA: 2531542 ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total: 0

Final Cytosine Methylation Report

Total number of C's analysed: 0

Total methylated C's in CpG context: 0 Total methylated C's in CHG context: 0 Total methylated C's in CHH context: 0 Total methylated C's in Unknown context: 0

Total unmethylated C's in CpG context: 0 Total unmethylated C's in CHG context: 0 Total unmethylated C's in CHH context: 0 Total unmethylated C's in Unknown context: 0

Can't determine percentage of methylated Cs in CpG context if value was 0 Can't determine percentage of methylated Cs in CHG context if value was 0 Can't determine percentage of methylated Cs in CHH context if value was 0 Can't determine percentage of methylated Cs in unknown context (CN or CHN) if value was 0

Bismark completed in 0d 3h 25m 51s

Here you can find other details:

Do you have any suggestion? I really thank you in advance for your availability. Silvia Signorini

FelixKrueger commented 3 months ago

Hi Silvia, something here definitely looks fishy (no pun intended). Could you provide me a with a few reads of your read library (ideally raw, untrimmed reads, and let me know which kind of library prep had been performed). Maybe 200K sequences (gzipped), as an email attachement?

I will in the meantime try to get the reference sequence and have it prepared. Cheers, Felix

Silvia-Giorgia-Signorini commented 3 months ago

Thanks a lot for your fast reply! I have prepared individual libraries through the NEBNext Enzymatic Methyl-seq Kit (EM-seq™) and sequenced on an Illumina Novaseq 6000 sequencer and attached you can find a subsample of an individual (R1 and R2, raw data). Do you need more information? Thanks a lot, Silvia subsampleR1_100K.fq.gz subsampleR2_100K.fq.gz

FelixKrueger commented 3 months ago

I took your files, trimmed 10bp from both 5' ends, and ran alignments: (directional):

Sequence pairs analysed in total:       99872
Number of paired-end alignments with a unique best hit: 36615
Mapping efficiency:     36.7%

Sequence pairs with no alignments under any condition:  47409
Sequence pairs did not map uniquely:    15848
Sequence pairs which were discarded because genomic sequence could not be extracted:    0

Number of sequence pairs with unique best (first) alignment came from the bowtie output:
CT/GA/CT:       18493   ((converted) top strand)
GA/CT/CT:       0       (complementary to (converted) top strand)
GA/CT/GA:       0       (complementary to (converted) bottom strand)
CT/GA/GA:       18122   ((converted) bottom strand)

Number of alignments to (merely theoretical) complementary strands being rejected in total:     0

Final Cytosine Methylation Report
=================================
Total number of C's analysed:   1769028

Total methylated C's in CpG context:    74903
Total methylated C's in CHG context:    1838
Total methylated C's in CHH context:    7128
Total methylated C's in Unknown context:        17

Total unmethylated C's in CpG context:  192829
Total unmethylated C's in CHG context:  296885
Total unmethylated C's in CHH context:  1195445
Total unmethylated C's in Unknown context:      430

C methylated in CpG context:    28.0%
C methylated in CHG context:    0.6%
C methylated in CHH context:    0.6%
C methylated in unknown context (CN or CHN):    3.8%

Bismark completed in 0d 0h 1m 47s

So all works well, there is a 50/50 split between top and bottom strand. I am not exactly sure what went wrong on your side, but something did... Do you have enough resources available at all?

Silvia-Giorgia-Signorini commented 3 months ago

I apologize for my late reply but I was trying to figure out what was my problem. I think I had an issue with the original fasta file of the reference genome, now I fixed it thanks to your indications. I really thank you for your time! Best wishes Silvia

FelixKrueger commented 3 months ago

That's good to hear! All the best