Closed Camethyleabergen closed 4 years ago
Hi there,
This is a little weird. The error message by the methylation extractor is that the read IDs in your SAM/BAM file do not come from the same read:
Read 1 ID: SRR3184550.95891407
Read 2 ID: SRR3184550.95891408
Which is fair enough, as these IDs are not the same.
I just downloaded the FastQ files for this accession, and the reads look like this:
File 1:
@SRR3184550.1 1/1
TTATTGATTTATTGTGATATGTACGTGGGTTTTTAGTAGTAAGAAATTAAAGGGTTGTAGGTCGGTTTTTGTTAATTTTTTTTATTTTAAGATAGTTTT
+
@CCFFFDFHH?HHGCFGHIJGJJGHGIJGHGHJJGIDFGIIIGGGEEHHGIIBHDDH@BCA8=A<-:=BFDFEECAEA@AB?&2:A>CECCDACCDDED
@SRR3184550.2 2/1
TATGTTTGGTAATTATGGTATTATTATTTGTAAGTATTTATATAAATGGGAAGATTTTATTAAGGATGATTTAGGTTTAAGGTTTTTAAAATAGTTGAT
+
@CCFFFEDHF+CFHIIIEFHGIGIEBHIGEEA<@AAEHIJCGGGIHHIJJC?F>FHHGIIIFBFA3DFGHIECFI8;@GIGI@@DGFH=E@D@CECB>C
@SRR3184550.3 3/1
TTTTTTGATTTGGGAGTTTAATTCTTTTAATTTAGTATGTAAGAAAATTAATTTGGAAATATTAATTGTTTTGGAAATTTTAGATTTAGGTTATTTTT
+
CCCFFFFFDFHFFBGBEGIHJIJJIJJJJJJIJJCFHGIGFIIJJJIIJIIIJFGGIGIJIHIIGJJIIJIGCBEEEGFEFD?@CEDDCCCACCDDDD
File 2:
@SRR3184550.1 1/2
ACATACTAAATTAAAAAAATTAAACTCCCAAATCAAAAAAAATAAATAATAACTATATTTTAACTAATACTTAAAATCATCCAAAAAAAAAAATAATAA
+
?@@DDDDDHBHHBIGIFIIII@FGIGHIIIHIIIIHDGGIGIIGGEGEIIHHFHHEEECDDDDCCCCCCCCCECCCECCCCCCCCBBBBBBBB8>@@@C
@SRR3184550.2 2/2
ACCAACTAAAATACAACTATCATTCATTTTCCTTAATTTAAAAACCTAACTTCTTTAATTTAATATACAAAAAAACC
+
1+==:+2=ABDD4<+<:2ABEFDEEII9FC@F4<:CEEFA@<4C??D*:DEIEDEIIABDEDC))8)>B8.=@DADD
@SRR3184550.3 3/2
AAATAACCCCCTTTAAACATCTCAATTAACTTTATATTTAATAATTATAACACTACTACTTACAAATATTTATATAAATAAAAAAATCTTACTAAAAAT
+
@CCFDFFFHHHAHGHFIIJIIGHIJIJJIJIAFHGIJFIIJIJJFCGFIIIIICHIHCH<DHHIJJJGIIGGIGIGGHIIIIHHFDAECCDDCEDDCDB
If you align these reads with Bismark you get the following output:
@PG ID:Bismark VN:v0.22.3 CL:"bismark --genome /bi/scratch/Genomes/Human/GRCh38/ -1 SRR3184550_10K_1.fastq -2 SRR3184550_10K_2.fastq"
SRR3184550.1_1/1 83 22 15927768 7 99M = 15927636 -231 AAAACTATCTTAAAATAAAAAAAATTAACAAAAACCGACCTACAACCCTTTAATTTCTTACTACTAAAAACCCACGTACATATCACAATAAATCAATAA DEDDCCADCCEC>A:2&?BA@AEACEEFDFB=:-<A=8ACB@HDDHBIIGHHEEGGGIIIGFDIGJJHGHGJIGHGJJGJIHGFCGHH?HHFDFFFCC@ NM:i:20 MD:Z:0G1G3G4G0G3T3G6G2G6G3G1G8G6G5G0G0G9G3G6A10 XM:Z:h.h...x....hh.......h......h..x.....Zx...x.z........h......h.....xhh.......Z.h...h................. XR:Z:CT XG:Z:GA
SRR3184550.1_1/1 163 22 15927636 7 99M = 15927768 231 ACATACTAAATTAAAAAAATTAAACTCCCAAATCAAAAAAAATAAATAATAACTATATTTTAACTAATACTTAAAATCATCCAAAAAAAAAAATAATAA ?@@DDDDDHBHHBIGIFIIII@FGIGHIIIHIIIIHDGGIGIIGGEGEIIHHFHHEEECDDDDCCCCCCCCCECCCECCCCCCCCBBBBBBBB8>@@@C NM:i:18 MD:Z:0G14G6G0G12G0G5G3G3G2G7G2G0G6G12G1G1G3G4 XM:Z:h..............h......hh............hh.....h...h...h..x.......h..xh......h............h.h.h...h.... XR:Z:GA XG:Z:GA
SRR3184550.2_2/1 99 22 15927517 42 99M = 15927627 187 TATGTTTGGTAATTATGGTATTATTATTTGTAAGTATTTATATAAATGGGAAGATTTTATTAAGGATGATTTAGGTTTAAGGTTTTTAAAATAGTTGAT @CCFFFEDHF+CFHIIIEFHGIGIEBHIGEEA<@AAEHIJCGGGIHHIJJC?F>FHHGIIIFBFA3DFGHIECFI8;@GIGI@@DGFH=E@D@CECB>C NM:i:13 MD:Z:18C1C2C2C3C24C3C9C0C5C8C0C4C7 XM:Z:..................h.h..h..h...h........................h...h.........hh.....h........hh....x....... XR:Z:CT XG:Z:CT
SRR3184550.2_2/1 147 22 15927627 42 77M = 15927517 -187 GGTTTTTTTGTATATTAAATTAAAGAAGTTAGGTTTTTAAATTAAGGAAAATGAATGATAGTTGTATTTTAGTTGGT DDAD@=.8B>)8))CDEDBAIIEDEIED:*D??C4<@AFEEC:<4F@CF9IIEEDFEBA2:<+<4DDBA=2+:==+1 NM:i:11 MD:Z:6C3C3C12A5C1C0C0C4C18C10C4 XM:Z:......h...h...h..................h.hhh....h..................x..........x.... XR:Z:GA XG:Z:CT
As you can see Read 1 and Read 2 have the very same Read ID, as it is required by the methylation extractor.
I don't know where you got your SAM/BAM file from, but whoever modified or filtered the SAM file must have done something to the read IDs so that it doesn't fit any more. Have you tried name sorting the BAM files to see if Read 1 and Read 2 can be brought together again? The command for this would be:
samtools sort -n your_file.bam correct_format.bam
Hi,
thanks for your short delay to answer. Yes actually the file I was using was already sorted by samtools given these parameters. I'll try from the fastq files then The bam/sam file came actually right from the same source, just on the "aligned" part. Thanks !
So it would appear that the file underwent some sort of re-formatting probably during the time of submission...
Hi, I'd like to rule out a scientific hypothese and I needed to have methylation data on mitochondrial DNA. I searched on SRA NCBi and found out this : https://trace.ncbi.nlm.nih.gov/Traces/sra/?run=SRR3184550
I only downloaded the already aligned BS sequences to mitochondrial DNA, and I have a SAM file which shows like that :
@HD VN:1.0 SO:queryname @SQ SN:chrM LN:16571 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @RG ID:default @PG ID:Bismark VN:v0.14.1 CL:"bismark --path_to_bowtie /home/ambima/bowtie2-2.2.3 -bowtie2 /home/ambima/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta --bam -1 combined_CD1A_1_CTTGTA_L001_R1_001_val_1.fq.gz -2 combined_CD1A_1_CTTGTA_L001_R2_001_val_2.fq.gz" SRR3184550.95891407 147 chrM 93 40 100M = 1 -192 AGATGTTGGAGTTGGAGTATTTTATGTTGTAGTATTCGTTTTTGATTTTTGTTTTATTTTATTATTTATTGTATTTGTGTTTAATATTATAGGTGAATAT @C?BCCCCCCCCC@CCCEDECCB?HAEECIIGE;HBFEEIIIHDCHCIIHHEEGHFG?<HCIIIIGHAEGHIHGHCF>HHFCFAIIIFHHFFDDDDD@@@ NH:i:1 NM:i:29 SRR3184550.95891408 99 chrM 2 24 100M = 92 189 GTTATAGGTTTATTATTTTATTAATTATTTATGGGAGTTTTTTATGTATTTGGTATTTTTGTTTGGGGGGTATGTATGTGATAGTATTGTGAGATGTTGG ?@@;DDBD=DFDHHGGHIIGEH<FBHHFC<HHFGH>FC9@FHIGIDHBHCE<D89BFHBGFC3=CABHFB;5:AD3@DBCC>C@>CDDCEC:C>@CCC@B NH:i:1 NM:i:28
As the command of methylation extractor will tell me :bismark_methylation_extractor /data/cache/SRR3184550_chrM.order.sam
The IDs of Read 1 (SRR3184550.95891407) and Read 2 (SRR3184550.95891408) are not the same. This might be the result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file using 'samtools sort -n' (by read name). This may also occur using samtools merge as it does not guarantee the read order. To properly merge files please use 'samtools merge -n' or 'samtools cat'.
I however tried to sort via samtools, and it seems to work, but I have the same error on bismark. I understand that the reads are not the same ID, and I think I understand the sam has already been processed by bismark before being put online. I'd like actually just to have methylation scores for each position. Is it the right way to do so?
Best,