Potential issues with sorting by read ID and Sambamba

duncansproul commented 6 years ago

Hi Felix,

As I mentioned by e-mail a few weeks ago, I think I’ve come across an issue with sorting Bismark BAM files that can cause problems with downstream analysis. Essentially, methylation calls for paired-end data can get misallocated from OB to CTOB when you sort a Bismark BAM file by position before resorting by read name.

I’ve done a few more tests and read the methylation extractor code and confirmed that the problem does happen. Bizarrely, it only happens using Sambamba to sort and it is not seen using Samtools. I thought it was worth reporting it in case others come across it and also because it potentially has a fairly easy solution that could be incorporated into Bismark in future.

As I understand it, the methylation extractor module determines what strand data come from by comparing the XR and XG codes for the first read in each pair (read and genome conversion fields respectively). I think the following segment of code deals with this (this is code ~Line 2602):

if ($first_read_conversion eq 'CT' and $genome_conversion eq 'CT') {
      $index = 0;       ## this is OT
      $strand = '+';
    } elsif ($first_read_conversion eq 'GA' and $genome_conversion eq 'CT') {
      $index = 1;       ## this is CTOT
      $strand = '-';
    } elsif ($first_read_conversion eq 'GA' and $genome_conversion eq 'GA') {
      $index = 2;       ## this is CTOB
      $strand = '+';
    } elsif ($first_read_conversion eq 'CT' and $genome_conversion eq 'GA') {
      $index = 3;       ## this is OB
      $strand = '-';
    } else {
      die "Unexpected combination of read and genome conversion: $first_read_conversion / $genome_conversion\n";
    }

After bismark alignment all read pairs are arranged in the BAM with R1 followed by R2 and the methylation extractor assumes this. Both reads are also given exactly the same ID during alignment, ie the part indicating R1 or R2 is the same (they are both labelled R1). Here is some toy output from a short BAM (just showing the important fields with two pairs of reads):

E00397:87:HHMVMCCXY:6:1101:7720:1221_1:N:0:27   99  chr1    106372402   XR:Z:CT XG:Z:CT
E00397:87:HHMVMCCXY:6:1101:7720:1221_1:N:0:27   147 chr1    106372475   XR:Z:GA XG:Z:CT
E00397:87:HHMVMCCXY:6:1101:13930:1221_1:N:0:27  83  chr12   45034123    XR:Z:CT XG:Z:GA
E00397:87:HHMVMCCXY:6:1101:13930:1221_1:N:0:27  163 chr12   45033911    XR:Z:GA XG:Z:GA

After positional sorting, pairs aligned to OB are reversed because R2 has a position lower than R1 (OT aligned reads remain the same):

E00397:87:HHMVMCCXY:6:1101:7720:1221_1:N:0:27   99  chr1    106372402   XR:Z:CT XG:Z:CT
E00397:87:HHMVMCCXY:6:1101:7720:1221_1:N:0:27   147 chr1    106372475   XR:Z:GA XG:Z:CT
E00397:87:HHMVMCCXY:6:1101:13930:1221_1:N:0:27  163 chr12   45033911    XR:Z:GA XG:Z:GA
E00397:87:HHMVMCCXY:6:1101:13930:1221_1:N:0:27  83  chr12   45034123    XR:Z:CT XG:Z:GA

After sorting the position-sorted file by read name, this inverse ordering of OB aligned pairs persists. This is probably because both R1 and R2 are assigned exactly the same ID during bismark alignment.

E00397:87:HHMVMCCXY:6:1101:7720:1221_1:N:0:27   99  chr1    106372402   XR:Z:CT XG:Z:CT
E00397:87:HHMVMCCXY:6:1101:7720:1221_1:N:0:27   147 chr1    106372475   XR:Z:GA XG:Z:CT
E00397:87:HHMVMCCXY:6:1101:13930:1221_1:N:0:27  163 chr12   45033911    XR:Z:GA XG:Z:GA
E00397:87:HHMVMCCXY:6:1101:13930:1221_1:N:0:27  83  chr12   45034123    XR:Z:CT XG:Z:GA

As I mentioned, this appears to be specific to Sambamba. Repeating a similar double sort with Samtools reverts to the correct order. This implies that Samtools sort also looks at something other than the read name field when sorting by read ID.

The issue is not picked up by the “Now testing Bismark result file $filename for positional sorting (which would be bad...)” part of the program because it tests: i) that the BAM file does not contain the positional sorting flag. ii) that the first 100,000 reads are paired by ID (which they are).

The problem can be recapitulated using the attached short BAM file and the following commands (assuming sambamba and samtools are installed): TEST_bismark.bam.zip

# Original file
samtools view TEST_bismark.bam | cut -f 1,2,3,4,15,16

# Sort by position then by Read ID with SAMBAMBA
sambamba sort -o TEST_bismark_sortPos_SAMBAMBA.bam TEST_bismark.bam
sambamba sort -n -o TEST_bismark_sortID_SAMBAMBA.bam TEST_bismark_sortPos_SAMBAMBA.bam

# View results:
# Sort by position - SAMBAMBA
samtools view TEST_bismark_sortPos_SAMBAMBA.bam | cut -f 1,2,3,4,15,16
# Resort by read ID - SAMBAMBA
samtools view TEST_bismark_sortID_SAMBAMBA.bam | cut -f 1,2,3,4,15,16

samtools view TESTshort_bismark_sortPos.bam | cut -f 1,2,3,4,15,16
samtools view TESTshort_bismark_sortID.bam | cut -f 1,2,3,4,15,16

# Now repeat with Samtools:
# Sort by position then name
samtools sort TEST_bismark.bam TESTshort_bismark_sortPos_SAMTOOLS
samtools sort -n TESTshort_bismark_sortPos.bam TEST_bismark_sortID_SAMTOOLS

# View results
# Sort by position - Samtools
samtools view TESTshort_bismark_sortPos_SAMTOOLS.bam | cut -f 1,2,3,4,15,16
# Resort by ID - Samtools
samtools view TESTshort_bismark_sortID_SAMTOOLs.bam | cut -f 1,2,3,4,15,16

The two solutions are: i) Don’t use Sambamba for sorting if you are using Bismark. ii) In the future, R1/R2 IDs could be preserved after Bismark alignment. I did a quick test by manually editing the small bismark BAM file and it removed the problem. Appending a new _1 or _2 worked also.

For completeness, I should say I’ve seen this problem with Bismark v0.16.3 and v0.18.1 but have not yet tested on the latest version. I’m also using samtools v1.2 and sambamba v0.6.6.

Thanks

Duncan

FelixKrueger commented 6 years ago

Thanks Duncan, I haven't forgotten your issue but it is probably a good idea to document it here anyways.

FelixKrueger commented 6 years ago

Hi Duncan,

Many thanks again for providing such an excellent issue report, including test data and a detailed step-by-step procedure. I have now looked at the issue in more detail, and I am coming to the conclusion that the problem here is probably more an unwanted behaviour of sambamba rather than something I would like to address explicitly within Bismark.

As you stated, re-sorting a positionally sorted BAM file by name with samtools brings R1 and R2 into the correct order again. For the sorting-by-name procedure, I believe that Samtools does not only use the readID on its own, but uses the FLAG value in addition to that:

SAM flag 83: read paired read mapped in proper pair read reverse strand first in pair

SAM flag 163: read paired read mapped in proper pair mate reverse strand second in pair

In contrast, sambamba doesn't seem to be doing that. So we have 'encoded' the read identity into the SAM FLAG already if you will, albeit not explicitly via appending R1 or R2 in the read ID field.

In a nutshell I would like to settle for your first proposed solution: i) Don’t use Sambamba for sorting if you are using Bismark.

or maybe: "If you absolutely have to sort Bismark BAM files by chromosomal position between the mapping and deduplication and/or methylation extraction procedure (which we definitely do not advise you should do) you may use samtools or sambamba for the positional sorting; if you want to re-sort these files by read name you have to be using samtools sort -n but not sambamba.

And as a final note: the corner-case issue of positionally sorting and then re-sorting by name that aforementioned positionally sorted file with sambamba discussed here may result in incorrectly assigning the strand from OB to CTOB. This should however not make any difference for the actual methylation call itself, because both OB and CTOB report on the very same C positions on the bottom strand. So it really is rather a cosmetic issue affecting read orientations anyway.

Does this sound like an acceptable solution? Cheers, Felix

duncansproul commented 6 years ago

Hi Felix,

Yes, that was my conclusion regarding likely cause of the difference between samtools and sambamba also. The solution works for me so I'm happy with it going forward. This thread will hopefully help anyone else who encounters this (maybe we can change the title to 'Problem with Bismark and Sambamba').

I think one situation where this could arise that might be more common is merging BAM files, e.g. if someone had multiple FASTQ files and wanted to merge them after aligning.

When I was doing this previously, I naively assumed sambamba merge would be better than samtools as it was supposed to be faster. However, sambamba merge requires sorted BAM files so I thought resorting by name after the merge would restore the correct order for bismark. Obviously using samtools instead would be fine in this case especially as samtools merge can merge read ID sorted files anyway and from what i understand samtools cat can merge unsorted BAMs.

Out of interest, was there a particular reason why bismark uses the same read ID for R1 and 2?

Cheers and thank you for all the help

Duncan

FelixKrueger commented 6 years ago

Hi Duncan,

I have changed the title of this issue so that it might help users in the future.

And yes indeed, samtools merge -n or samtools cat should both work well in this case..

Finally, regarding your last question: I have to admit that I can't remember whether there was any particular reason for using only a single read ID even for a read pair, but I just checked the code and it simply uses the readID of the first read as seqID. For Bismark purposes this is all you need since we always print out R1 and R2 on consecutive lines, but had I known some 6 years ago what people do to Bismark files as intermediate steps I might have done it differently :). It would be a trivial thing to add/change, but this is really the first time it came up....

duncansproul commented 6 years ago

Hi Felix,

Great, thank you again for all your help.

Cheers

Duncan ———— CRUK Career Development Fellow MRC Human Genetics Unit and Edinburgh Cancer Research Centre, MRC IGMM University of Edinburgh, Western General Hospital, Crewe Road Edinburgh, UK EH4 2XU T: +44 (0)131 651 8500 WWW: http://www.ed.ac.uk/mrc-human-genetics-unit/research/sproul-group

On 30 Apr 2018, at 12:00, FelixKrueger notifications@github.com<mailto:notifications@github.com> wrote:

Hi Duncan,

I have changed the title of this issue so that it might help users in the future.

And yes indeed, samtools merge -n or samtools cat should both work well in this case..

Finally, regarding your last question: I have to admit that I can't remember whether there was any particular reason for using only a single read ID even for a read pair, but I just checked the code and it simply uses the readID of the first read as seqID. For Bismark purposes this is all you need since we always print out R1 and R2 on consecutive lines, but had I known some 6 years ago what people do to Bismark files as intermediate steps I might have done it differently :). It would be a trivial thing to add/change, but this is really the first time it came up....

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/170#issuecomment-385367212, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ak9rmTZkpKHzSJ9eo6UUHh8dPS3dFHWoks5ttu7ZgaJpZM4TlWUy.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

cviner commented 5 years ago

Thanks so much for all of these details!

I may be encountering this issue using MethPipe, with a Sambamba sorted file. There was another underlying issue, but this sorting difference is apparent and perhaps a problem.

I will try reporting this upstream to Sambamba, since this should at least be documented. It sounds like an option to exactly sort like Samtools would be particularly useful.

FelixKrueger commented 5 years ago

Great to hear that old issues can still be useful! All the best, Felix

duncansproul commented 5 years ago

No problem Coby, glad it has been useful.

Duncan ———— CRUK Career Development Fellow MRC Human Genetics Unit and Edinburgh Cancer Research Centre, MRC IGMM University of Edinburgh, Western General Hospital, Crewe Road Edinburgh, UK EH4 2XU T: +44 (0)131 651 8500 WWW: http://www.ed.ac.uk/mrc-human-genetics-unit/research/sproul-group

On 24 Sep 2018, at 18:43, Coby Viner notifications@github.com<mailto:notifications@github.com> wrote:

Thanks so much for all of these details!

I just encountered this issue using MethPipehttps://github.com/smithlabcode/methpipe, with a Sambamba sorted file. This actually resulted in a quite confusing segmentation fault. Switching to Samtools for sorting did indeed fix the problem.

I will try reporting this upstream to Sambambahttps://github.com/biod/sambamba, since this should at least be documented. It sounds like an option to exactly sort like Samtools would be particularly useful.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/FelixKrueger/Bismark/issues/170#issuecomment-424062557, or mute the threadhttps://github.com/notifications/unsubscribe-auth/Ak9rmc6VQiySteT5YHLIZ0c8OKsKNfYiks5ueRnegaJpZM4TlWUy.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

FelixKrueger / Bismark

Potential issues with sorting by read ID and Sambamba #170