Closed duncansproul closed 6 years ago
Thanks Duncan, I haven't forgotten your issue but it is probably a good idea to document it here anyways.
Hi Duncan,
Many thanks again for providing such an excellent issue report, including test data and a detailed step-by-step procedure. I have now looked at the issue in more detail, and I am coming to the conclusion that the problem here is probably more an unwanted behaviour of sambamba
rather than something I would like to address explicitly within Bismark.
As you stated, re-sorting a positionally sorted BAM file by name with samtools
brings R1 and R2 into the correct order again. For the sorting-by-name procedure, I believe that Samtools does not only use the readID on its own, but uses the FLAG value in addition to that:
SAM flag 83: read paired read mapped in proper pair read reverse strand first in pair
SAM flag 163: read paired read mapped in proper pair mate reverse strand second in pair
In contrast, sambamba
doesn't seem to be doing that. So we have 'encoded' the read identity into the SAM FLAG already if you will, albeit not explicitly via appending R1 or R2 in the read ID field.
In a nutshell I would like to settle for your first proposed solution:
i) Don’t use Sambamba
for sorting if you are using Bismark.
or maybe: "If you absolutely have to sort Bismark BAM files by chromosomal position between the mapping and deduplication and/or methylation extraction procedure (which we definitely do not advise you should do) you may use samtools
or sambamba
for the positional sorting; if you want to re-sort these files by read name you have to be using samtools sort -n
but not sambamba
.
And as a final note: the corner-case issue of positionally sorting and then re-sorting by name that aforementioned positionally sorted file with sambamba
discussed here may result in incorrectly assigning the strand from OB to CTOB. This should however not make any difference for the actual methylation call itself, because both OB and CTOB report on the very same C positions on the bottom strand. So it really is rather a cosmetic issue affecting read orientations anyway.
Does this sound like an acceptable solution? Cheers, Felix
Hi Felix,
Yes, that was my conclusion regarding likely cause of the difference between samtools
and sambamba
also. The solution works for me so I'm happy with it going forward. This thread will hopefully help anyone else who encounters this (maybe we can change the title to 'Problem with Bismark and Sambamba').
I think one situation where this could arise that might be more common is merging BAM files, e.g. if someone had multiple FASTQ files and wanted to merge them after aligning.
When I was doing this previously, I naively assumed sambamba merge
would be better than samtools
as it was supposed to be faster. However, sambamba merge
requires sorted BAM files so I thought resorting by name after the merge would restore the correct order for bismark
. Obviously using samtools
instead would be fine in this case especially as samtools merge
can merge read ID sorted files anyway and from what i understand samtools cat
can merge unsorted BAMs.
Out of interest, was there a particular reason why bismark
uses the same read ID for R1 and 2?
Cheers and thank you for all the help
Duncan
Hi Duncan,
I have changed the title of this issue so that it might help users in the future.
And yes indeed, samtools merge -n
or samtools cat
should both work well in this case..
Finally, regarding your last question: I have to admit that I can't remember whether there was any particular reason for using only a single read ID even for a read pair, but I just checked the code and it simply uses the readID of the first read as seqID. For Bismark purposes this is all you need since we always print out R1 and R2 on consecutive lines, but had I known some 6 years ago what people do to Bismark files as intermediate steps I might have done it differently :). It would be a trivial thing to add/change, but this is really the first time it came up....
Hi Felix,
Great, thank you again for all your help.
Cheers
Duncan ———— CRUK Career Development Fellow MRC Human Genetics Unit and Edinburgh Cancer Research Centre, MRC IGMM University of Edinburgh, Western General Hospital, Crewe Road Edinburgh, UK EH4 2XU T: +44 (0)131 651 8500 WWW: http://www.ed.ac.uk/mrc-human-genetics-unit/research/sproul-group
On 30 Apr 2018, at 12:00, FelixKrueger notifications@github.com<mailto:notifications@github.com> wrote:
Hi Duncan,
I have changed the title of this issue so that it might help users in the future.
And yes indeed, samtools merge -n or samtools cat should both work well in this case..
Finally, regarding your last question: I have to admit that I can't remember whether there was any particular reason for using only a single read ID even for a read pair, but I just checked the code and it simply uses the readID of the first read as seqID. For Bismark purposes this is all you need since we always print out R1 and R2 on consecutive lines, but had I known some 6 years ago what people do to Bismark files as intermediate steps I might have done it differently :). It would be a trivial thing to add/change, but this is really the first time it came up....
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/170#issuecomment-385367212, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ak9rmTZkpKHzSJ9eo6UUHh8dPS3dFHWoks5ttu7ZgaJpZM4TlWUy.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Thanks so much for all of these details!
I may be encountering this issue using MethPipe, with a Sambamba sorted file. There was another underlying issue, but this sorting difference is apparent and perhaps a problem.
I will try reporting this upstream to Sambamba, since this should at least be documented. It sounds like an option to exactly sort like Samtools would be particularly useful.
Great to hear that old issues can still be useful! All the best, Felix
No problem Coby, glad it has been useful.
Duncan ———— CRUK Career Development Fellow MRC Human Genetics Unit and Edinburgh Cancer Research Centre, MRC IGMM University of Edinburgh, Western General Hospital, Crewe Road Edinburgh, UK EH4 2XU T: +44 (0)131 651 8500 WWW: http://www.ed.ac.uk/mrc-human-genetics-unit/research/sproul-group
On 24 Sep 2018, at 18:43, Coby Viner notifications@github.com<mailto:notifications@github.com> wrote:
Thanks so much for all of these details!
I just encountered this issue using MethPipehttps://github.com/smithlabcode/methpipe, with a Sambamba sorted file. This actually resulted in a quite confusing segmentation fault. Switching to Samtools for sorting did indeed fix the problem.
I will try reporting this upstream to Sambambahttps://github.com/biod/sambamba, since this should at least be documented. It sounds like an option to exactly sort like Samtools would be particularly useful.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/170#issuecomment-424062557, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ak9rmc6VQiySteT5YHLIZ0c8OKsKNfYiks5ueRnegaJpZM4TlWUy.
The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Hi Felix,
As I mentioned by e-mail a few weeks ago, I think I’ve come across an issue with sorting Bismark BAM files that can cause problems with downstream analysis. Essentially, methylation calls for paired-end data can get misallocated from OB to CTOB when you sort a Bismark BAM file by position before resorting by read name.
I’ve done a few more tests and read the methylation extractor code and confirmed that the problem does happen. Bizarrely, it only happens using Sambamba to sort and it is not seen using Samtools. I thought it was worth reporting it in case others come across it and also because it potentially has a fairly easy solution that could be incorporated into Bismark in future.
As I understand it, the methylation extractor module determines what strand data come from by comparing the XR and XG codes for the first read in each pair (read and genome conversion fields respectively). I think the following segment of code deals with this (this is code ~Line 2602):
After bismark alignment all read pairs are arranged in the BAM with R1 followed by R2 and the methylation extractor assumes this. Both reads are also given exactly the same ID during alignment, ie the part indicating R1 or R2 is the same (they are both labelled R1). Here is some toy output from a short BAM (just showing the important fields with two pairs of reads):
After positional sorting, pairs aligned to OB are reversed because R2 has a position lower than R1 (OT aligned reads remain the same):
After sorting the position-sorted file by read name, this inverse ordering of OB aligned pairs persists. This is probably because both R1 and R2 are assigned exactly the same ID during bismark alignment.
As I mentioned, this appears to be specific to Sambamba. Repeating a similar double sort with Samtools reverts to the correct order. This implies that Samtools sort also looks at something other than the read name field when sorting by read ID.
The issue is not picked up by the “Now testing Bismark result file $filename for positional sorting (which would be bad...)” part of the program because it tests: i) that the BAM file does not contain the positional sorting flag. ii) that the first 100,000 reads are paired by ID (which they are).
The problem can be recapitulated using the attached short BAM file and the following commands (assuming sambamba and samtools are installed): TEST_bismark.bam.zip
The two solutions are: i) Don’t use Sambamba for sorting if you are using Bismark. ii) In the future, R1/R2 IDs could be preserved after Bismark alignment. I did a quick test by manually editing the small bismark BAM file and it removed the problem. Appending a new _1 or _2 worked also.
For completeness, I should say I’ve seen this problem with Bismark v0.16.3 and v0.18.1 but have not yet tested on the latest version. I’m also using samtools v1.2 and sambamba v0.6.6.
Thanks
Duncan