FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
394 stars 103 forks source link

Issue with data coming from SRA NCBI and bismark methylation extractor #302

Closed Camethyleabergen closed 4 years ago

Camethyleabergen commented 4 years ago

Hi, I'd like to rule out a scientific hypothese and I needed to have methylation data on mitochondrial DNA. I searched on SRA NCBi and found out this : https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3184550

I only downloaded the already aligned BS sequences to mitochondrial DNA, and I have a SAM file which shows like that :

@HD VN:1.0 SO:queryname @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @RG ID:default @PG ID:Bismark VN:v0.14.1 CL:"bismark --path_to_bowtie /home/ambima/bowtie2-2.2.3 -bowtie2 /home/ambima/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta --bam -1 combined_CD1A_1_CTTGTA_L001_R1_001_val_1.fq.gz -2 combined_CD1A_1_CTTGTA_L001_R2_001_val_2.fq.gz" SRR3184550.95891407 147 chrM 93 40 100M = 1 -192 AGATGTTGGAGTTGGAGTATTTTATGTTGTAGTATTCGTTTTTGATTTTTGTTTTATTTTATTATTTATTGTATTTGTGTTTAATATTATAGGTGAATAT @C?BCCCCCCCCC@CCCEDECCB?HAEECIIGE;HBFEEIIIHDCHCIIHHEEGHFG?<HCIIIIGHAEGHIHGHCF>HHFCFAIIIFHHFFDDDDD@@@ NH:i:1 NM:i:29 SRR3184550.95891408 99 chrM 2 24 100M = 92 189 GTTATAGGTTTATTATTTTATTAATTATTTATGGGAGTTTTTTATGTATTTGGTATTTTTGTTTGGGGGGTATGTATGTGATAGTATTGTGAGATGTTGG ?@@;DDBD=DFDHHGGHIIGEH<FBHHFC<HHFGH>FC9@FHIGIDHBHCE<D89BFHBGFC3=CABHFB;5:AD3@DBCC>C@>CDDCEC:C>@CCC@B NH:i:1 NM:i:28 As the command of methylation extractor will tell me : bismark_methylation_extractor /data/cache/SRR3184550_chrM.order.sam The IDs of Read 1 (SRR3184550.95891407) and Read 2 (SRR3184550.95891408) are not the same. This might be the result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file using 'samtools sort -n' (by read name). This may also occur using samtools merge as it does not guarantee the read order. To properly merge files please use 'samtools merge -n' or 'samtools cat'.

I however tried to sort via samtools, and it seems to work, but I have the same error on bismark. I understand that the reads are not the same ID, and I think I understand the sam has already been processed by bismark before being put online. I'd like actually just to have methylation scores for each position. Is it the right way to do so?

Best,

FelixKrueger commented 4 years ago

Hi there,

This is a little weird. The error message by the methylation extractor is that the read IDs in your SAM/BAM file do not come from the same read:

Read 1 ID: SRR3184550.95891407
Read 2 ID: SRR3184550.95891408

Which is fair enough, as these IDs are not the same.

I just downloaded the FastQ files for this accession, and the reads look like this:

File 1:

@SRR3184550.1 1/1
TTATTGATTTATTGTGATATGTACGTGGGTTTTTAGTAGTAAGAAATTAAAGGGTTGTAGGTCGGTTTTTGTTAATTTTTTTTATTTTAAGATAGTTTT
+
@CCFFFDFHH?HHGCFGHIJGJJGHGIJGHGHJJGIDFGIIIGGGEEHHGIIBHDDH@BCA8=A<-:=BFDFEECAEA@AB?&2:A>CECCDACCDDED
@SRR3184550.2 2/1
TATGTTTGGTAATTATGGTATTATTATTTGTAAGTATTTATATAAATGGGAAGATTTTATTAAGGATGATTTAGGTTTAAGGTTTTTAAAATAGTTGAT
+
@CCFFFEDHF+CFHIIIEFHGIGIEBHIGEEA<@AAEHIJCGGGIHHIJJC?F>FHHGIIIFBFA3DFGHIECFI8;@GIGI@@DGFH=E@D@CECB>C
@SRR3184550.3 3/1
TTTTTTGATTTGGGAGTTTAATTCTTTTAATTTAGTATGTAAGAAAATTAATTTGGAAATATTAATTGTTTTGGAAATTTTAGATTTAGGTTATTTTT
+
CCCFFFFFDFHFFBGBEGIHJIJJIJJJJJJIJJCFHGIGFIIJJJIIJIIIJFGGIGIJIHIIGJJIIJIGCBEEEGFEFD?@CEDDCCCACCDDDD

File 2:

@SRR3184550.1 1/2
ACATACTAAATTAAAAAAATTAAACTCCCAAATCAAAAAAAATAAATAATAACTATATTTTAACTAATACTTAAAATCATCCAAAAAAAAAAATAATAA
+
?@@DDDDDHBHHBIGIFIIII@FGIGHIIIHIIIIHDGGIGIIGGEGEIIHHFHHEEECDDDDCCCCCCCCCECCCECCCCCCCCBBBBBBBB8>@@@C
@SRR3184550.2 2/2
ACCAACTAAAATACAACTATCATTCATTTTCCTTAATTTAAAAACCTAACTTCTTTAATTTAATATACAAAAAAACC
+
1+==:+2=ABDD4<+<:2ABEFDEEII9FC@F4<:CEEFA@<4C??D*:DEIEDEIIABDEDC))8)>B8.=@DADD
@SRR3184550.3 3/2
AAATAACCCCCTTTAAACATCTCAATTAACTTTATATTTAATAATTATAACACTACTACTTACAAATATTTATATAAATAAAAAAATCTTACTAAAAAT
+
@CCFDFFFHHHAHGHFIIJIIGHIJIJJIJIAFHGIJFIIJIJJFCGFIIIIICHIHCH<DHHIJJJGIIGGIGIGGHIIIIHHFDAECCDDCEDDCDB

If you align these reads with Bismark you get the following output:

@PG     ID:Bismark      VN:v0.22.3      CL:"bismark --genome /bi/scratch/Genomes/Human/GRCh38/ -1 SRR3184550_10K_1.fastq -2 SRR3184550_10K_2.fastq"
SRR3184550.1_1/1        83      22      15927768        7       99M     =       15927636        -231    AAAACTATCTTAAAATAAAAAAAATTAACAAAAACCGACCTACAACCCTTTAATTTCTTACTACTAAAAACCCACGTACATATCACAATAAATCAATAA     DEDDCCADCCEC>A:2&?BA@AEACEEFDFB=:-<A=8ACB@HDDHBIIGHHEEGGGIIIGFDIGJJHGHGJIGHGJJGJIHGFCGHH?HHFDFFFCC@  NM:i:20 MD:Z:0G1G3G4G0G3T3G6G2G6G3G1G8G6G5G0G0G9G3G6A10 XM:Z:h.h...x....hh.......h......h..x.....Zx...x.z........h......h.....xhh.......Z.h...h.................    XR:Z:CT  XG:Z:GA
SRR3184550.1_1/1        163     22      15927636        7       99M     =       15927768        231     ACATACTAAATTAAAAAAATTAAACTCCCAAATCAAAAAAAATAAATAATAACTATATTTTAACTAATACTTAAAATCATCCAAAAAAAAAAATAATAA     ?@@DDDDDHBHHBIGIFIIII@FGIGHIIIHIIIIHDGGIGIIGGEGEIIHHFHHEEECDDDDCCCCCCCCCECCCECCCCCCCCBBBBBBBB8>@@@C  NM:i:18 MD:Z:0G14G6G0G12G0G5G3G3G2G7G2G0G6G12G1G1G3G4   XM:Z:h..............h......hh............hh.....h...h...h..x.......h..xh......h............h.h.h...h....    XR:Z:GA  XG:Z:GA
SRR3184550.2_2/1        99      22      15927517        42      99M     =       15927627        187     TATGTTTGGTAATTATGGTATTATTATTTGTAAGTATTTATATAAATGGGAAGATTTTATTAAGGATGATTTAGGTTTAAGGTTTTTAAAATAGTTGAT     @CCFFFEDHF+CFHIIIEFHGIGIEBHIGEEA<@AAEHIJCGGGIHHIJJC?F>FHHGIIIFBFA3DFGHIECFI8;@GIGI@@DGFH=E@D@CECB>C  NM:i:13 MD:Z:18C1C2C2C3C24C3C9C0C5C8C0C4C7      XM:Z:..................h.h..h..h...h........................h...h.........hh.....h........hh....x.......        XR:Z:CT      XG:Z:CT
SRR3184550.2_2/1        147     22      15927627        42      77M     =       15927517        -187    GGTTTTTTTGTATATTAAATTAAAGAAGTTAGGTTTTTAAATTAAGGAAAATGAATGATAGTTGTATTTTAGTTGGT   DDAD@=.8B>)8))CDEDBAIIEDEIED:*D??C4<@AFEEC:<4F@CF9IIEEDFEBA2:<+<4DDBA=2+:==+1        NM:i:11 MD:Z:6C3C3C12A5C1C0C0C4C18C10C4 XM:Z:......h...h...h..................h.hhh....h..................x..........x....      XR:Z:GA XG:Z:CT

As you can see Read 1 and Read 2 have the very same Read ID, as it is required by the methylation extractor.

I don't know where you got your SAM/BAM file from, but whoever modified or filtered the SAM file must have done something to the read IDs so that it doesn't fit any more. Have you tried name sorting the BAM files to see if Read 1 and Read 2 can be brought together again? The command for this would be:

samtools sort -n your_file.bam correct_format.bam

Camethyleabergen commented 4 years ago

Hi,

thanks for your short delay to answer. Yes actually the file I was using was already sorted by samtools given these parameters. I'll try from the fastq files then The bam/sam file came actually right from the same source, just on the "aligned" part. Thanks !

FelixKrueger commented 4 years ago

So it would appear that the file underwent some sort of re-formatting probably during the time of submission...