RWilton / Arioc

Arioc: GPU-accelerated DNA short-read alignment
BSD 3-Clause "New" or "Revised" License
55 stars 8 forks source link

Problem with calling methylation marks from AriocP Post aligned sam files #16

Closed RajneeshSrivastava closed 3 years ago

RajneeshSrivastava commented 3 years ago

Hi Team, I am trying to extract the methylation marks from the AriocP post aligned sam file by using the recommended bismark_methylation_extractor script. It is showing an error:

The IDs of Read 1 (A01204:9:HHFNMDSXY:1:1101:1018:2754 1:N:0:AGTCAACA) and Read 2 (A01204:9:HHFNMDSXY:1:1101:1018:2754 2:N:0:AGTCAACA) are not the same. 
This might be the result of sorting the paired-end SAM/BAM files by chromosomal position which is not compatible with correct methylation extraction. 
Please use an unsorted file instead or sort the file using 'samtools sort -n' (by read name). 
This may also occur using samtools merge as it does not guarantee the read order. 
To properly merge files please use 'samtools merge -n' or 'samtools cat'

However, as per recommendation by Bismark, I already sorted (sort -n) the files. Still it is giving the above error.

Below I am pasting the 1st two reads. I am not sure where the bismark_methylation_extractor script is finding it incorrect.

A01204:9:HHFNMDSXY:1:1118:28103:32534 1:N:0:AGTCAACA    83      chr3    67684399        44     151M     =       67684223        -327    TATCCCCCCTCTAAATTAAAATAAAATACACAAAATACTACCAACTATTTTTCCGACTCTTCCAAAATAAAATATATCCAAAATACTACTAATAAATTAAACCACATTAAAACATATAAACATTTAAAATCACTTACAACTTTTTTACTAT,FFFFFFFFFFFF:FF,FFFFFFFFFFFFFFF:FF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:F:FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFFF  AS:i:302        NM:i:0  MD:Z:151       YT:Z:CP  MQ:i:44 Na:i:3  Nb:i:1  c3:i:43 XM:Z:.u.........................h........h..x......x.......Z.............hh.h............h.....x..hh...h.h.........h...................................h..x.    XR:Z:CT XG:Z:GA YS:i:294
A01204:9:HHFNMDSXY:1:1118:28103:32534 2:N:0:AGTCAACA    163     chr3    67684223        44     4I147M   =       67684399        327     GGGGAATCCTATCAATAAACTATTTTTCTCCTCCCGAACCCTATTATTATTCATCCTATTTCCCTCCCTAACAAATATAACAATTATAACTATTATTATTTTTCTCAAATAAATTTAAATTATAAATTACATCAATTCGTCACAACCAAAAFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFFFFFF:FFFFF:FFF:::FFFF:FFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFF:FFFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFFFF  AS:i:294        NM:i:0  MD:Z:147       YT:Z:CP  MQ:i:44 Na:i:2  Nb:i:1  c3:i:45 XM:Z:.....u..........h.h..x.............Zxh....x..h..h....................xh.....h.h...x..h..h..x..h...............h.h........h.h..........x...Z.....x.....h    XR:Z:GA XG:Z:GA YS:i:302

Many thanks in advance! Rajneesh

RWilton commented 3 years ago

I think the problem is exactly what the bismark_methylation_extractor tool stated, namely, that the QNAME fields (the read IDs) for the mates are not identical. The difference is in the mate number, e.g. "1:N:0:AGTCAACA" for mate 1 and "2:N:0:AGTCAACA" for mate two.

Since the QNAMEs are merely copies of the original FASTQ deflines for each read, I can think of two ways to tackle this:

· Re-run AriocE using a QNAME pattern attribute in the element.

For example, <dataIn QNAME="*:*:*:(*:*:*:*) " ... would use only the 4th, 5th, 6th, and 7th fields of the QNAME (lane, tile, x, y) to identify reads.

If you need to preserve the information in the first three fields (instrument ID, run ID, flowcell ID) as well, you could do this: <dataIn QNAME="(*:*:*:*:*:*:*) " ...

However you choose to do it, the goal is to cause AriocP to emit only the encoded QNAME pattern instead of the entire defline.

Parsing the FASTQ defline for both QNAME and RG is described in the Arioc user guide. You can also check AriocE's output to be sure that you have encoded the desired pattern.

· Filter the QNAMEs in the SAM/BAM output.

This should be straightforward with a bit of Linux awk or sed code. There might also be a software tool somewhere that can accomplish this. (If there isn't, perhaps there ought to be!)

My preference is always to specify the QNAME encoding in AriocE.

· rw