broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
984 stars 369 forks source link

FixMateInformation ASSUME_SORTED option ineffective #1104

Open bredeson opened 6 years ago

bredeson commented 6 years ago

Hey Picard Devs,

I'm using Picard 2.17.6 and am trying to run FixMateInformation on a BAM file that I sorted by queryname with SAMtools v1.6; however, samtools interprets "queryname" as sorting queries naturally, whereas picard sorts queries lexigraphically:

$ java -jar ./picard.jar FixMateInformation VALIDATION_STRINGENCY=SILENT ASSUME_SORTED=false I=queryname_nat.bam O=fix.bam
Exception in thread "main" java.lang.IllegalArgumentException: Alignments added out of order in SAMFileWriterImpl.addAlignment for file://fix.bam. Sort order is queryname. Offending records are at [M02484:3:000000000-AAFE5:1:1101:2793:9990] and [M02484:3:000000000-AAFE5:1:1101:2793:10567]
    at htsjdk.samtools.SAMFileWriterImpl.assertPresorted(SAMFileWriterImpl.java:213)
    at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:200)
    at picard.sam.FixMateInformation.doWork(FixMateInformation.java:229)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:269)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

In the above queryname_nat.bam, Picard is choking when it encounters read M02484:3:000000000-AAFE5:1:1101:2793:10567 sorting after M02484:3:000000000-AAFE5:1:1101:2793:9990. To demonstrate that it was a difference between sorting naturally and sorting lexigraphically, I sorted the BAM file using Unix sort:

$ samtools view -H queryname_nat.bam >queryname_nat.bam.hdr
$ samtools view queryname_nat.bam | sort -T. -S 5G -k1,1 -k2,2n | cat queryname_nat.bam.hdr - | samtools view -bS - >queryname_lex.bam

Then re-ran FixMateInformation to completion:

$ java -jar ./picard.jar FixMateInformation VALIDATION_STRINGENCY=SILENT ASSUME_SORTED=false I=queryname_lex.bam O=fix.bam
# [...]
INFO    2018-01-31 17:25:19 FixMateInformation  Closing output file.
[Wed Jan 31 17:25:19 GMT 2018] picard.sam.FixMateInformation done. Elapsed time: 5.83 minutes.

Also, note that the ASSUME_SORTED option is set to false and Picard is not ignoring the SO tag in the @HD header (and is therefore not attempting to re-sort the BAM). Is this intended?

bredeson commented 6 years ago

Ah, after another few sips of coffee it occurred to me that if ASSUME_SORTED=false, then Picard checks the @HD SO tag to determine the sort order for itself. However, If I set ASSUME_SORTED=true and attempt to run Picard on my SAMtools-sorted (naturally-sorted) BAM file, it still fails with the same error. So either way, the ASSUME_SORTED option seems to not be working at all...

denklewer commented 6 years ago

Hello, as you mentioned above, Picards assumes records sorted lexicographically so records will be sorted using String.compareTo comparator. However samtools sorts records in such way: Records that has any number, starting from current position are less than records, that has letter at this position. If there are records which both have numbers, starting from current positions, they will be compared by these number values. image If records both have letters (instead of numbers) at the current position, they will be compared lexicographically but only until next numbers. image

image

In order to support such sorting SAMRecordQueryNameComparator should be rewritten.

So does picard need this type of sorting?

bredeson commented 6 years ago

Hey @denklewer,

I think that there should be an option that allows the user to specify that the reads are grouped by name (R1 and all its secondary/supplementary alignments are grouped with its R2 and all its secondary/supplementary alignments), as is usually the case from unsorted aligner BAM outputs. If Picard does not need for the reads to be strictly queryname-sorted (only grouped) for FixMateInformation to function correctly, there should be that option.

Thanks!