broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
975 stars 369 forks source link

MarkDuplicatesWithMateCigar errors on read ordering w/ sorted input. #327

Closed MurphyMarkW closed 8 years ago

MurphyMarkW commented 8 years ago

My group ran across this while running some TCGA datasets and I was hoping someone might have some deeper insight into what's going on. The files we're working with are (very) large and non-public. Because of that, I boiled down the inputs while maintaining the error so that I could share them.

It appears that picard MarkDuplicatesWithMateCigar will complain about reads being out of order when run with certain options and supplied with certain sorted inputs. The input sam passes without error strict ValidateSamFile, and will actually pass through MarkDuplicatesWithMateCigar perfectly fine depending on the settings used in the command.

The input I'm currently using to produce the error is:

@HD VN:1.5  SO:coordinate
@SQ SN:chr1 LN:248956422
@RG ID:0    SM:0    PL:ILLUMINA
QNAME0  67  chr1    1   0   101M    =   2   102 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:101M   RG:Z:0  NM:i:0
QNAME0  131 chr1    2   0   101M    =   1   -102    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:101M   RG:Z:0  NM:i:0
QNAME1  115 chr1    115 2   1S100M  =   116 102 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:101M   RG:Z:0  NM:i:0
QNAME1  179 chr1    116 2   101M    =   115 -102    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:1S100M RG:Z:0  NM:i:0

Strict ValidateSamFile for good measure:

java -jar picard-tools-1.140/picard.jar ValidateSamFile INPUT=sorting_error.sam VALIDATION_STRINGENCY=STRICT VERBOSITY=DEBUG
[Thu Nov 05 19:42:30 UTC 2015] picard.sam.ValidateSamFile INPUT=sorting_error.sam VERBOSITY=DEBUG VALIDATION_STRINGENCY=STRICT    MODE=VERBOSE MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Nov 05 19:42:30 UTC 2015] Executing as ubuntu@mwm-thelab on Linux 3.13.0-62-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater
DEBUG   2015-11-05 19:42:30 Option  Ignoring VALIDATE_CRC_CHECKSUMS option; does not apply to PrimitiveSamReaderToSamReaderAdapter readers.
DEBUG   2015-11-05 19:42:30 Option  Ignoring VALIDATE_CRC_CHECKSUMS option; does not apply to PrimitiveSamReaderToSamReaderAdapter readers.
No errors found
[Thu Nov 05 19:42:30 UTC 2015] picard.sam.ValidateSamFile done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=252182528

And the command that will produce the error is:

java -jar picard-tools-1.140/picard.jar MarkDuplicatesWithMateCigar INPUT=sorting_error.sam OUTPUT=dup METRICS_FILE=metrics.dup TMP_DIR=. MAX_RECORDS_IN_RAM=2 MINIMUM_DISTANCE=212 VALIDATION_STRINGENCY=STRICT VERBOSITY=DEBUG
[Thu Nov 05 19:39:19 UTC 2015] picard.sam.markduplicates.MarkDuplicatesWithMateCigar MINIMUM_DISTANCE=212 INPUT=[sorting_error.sam] OUTPUT=dup METRICS_FILE=metrics.dup TMP_DIR=[.] VERBOSITY=DEBUG VALIDATION_STRINGENCY=STRICT MAX_RECORDS_IN_RAM=2    SKIP_PAIRS_WITH_NO_MATE_CIGAR=true BLOCK_SIZE=100000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicatesWithMateCigar REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=TOTAL_MAPPED_REFERENCE_LENGTH READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Nov 05 19:39:19 UTC 2015] Executing as ubuntu@mwm-thelab on Linux 3.13.0-62-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater
DEBUG   2015-11-05 19:39:19 Option  Ignoring EAGERLY_DECODE option; does not apply to PrimitiveSamReaderToSamReaderAdapter readers.
WARNING 2015-11-05 19:39:19 AbstractOpticalDuplicateFinderCommandLineProgram    Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'QNAME0'.  You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates.  Note that this message will not be emitted again even if other read names do not match the regex.
WARNING 2015-11-05 19:39:19 MarkDuplicatesWithMateCigar Encountered a record with no program record, program group chaining will not occur for this read: QNAME0 1/2 101b aligned read.
Previous record: QNAME1 179 chr1    116 2   101M    =   115 -102    NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:1S100M RG:Z:0  NM:i:0
Current record:QNAME1   115 chr1    115 2   1S100M  =   116 102 NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN   JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ   MC:Z:101M   RG:Z:0  NM:i:0
[Thu Nov 05 19:39:19 UTC 2015] picard.sam.markduplicates.MarkDuplicatesWithMateCigar done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=252182528
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" picard.PicardException: Records were not found coordinate sort order
    at picard.sam.markduplicates.MarkDuplicatesWithMateCigarIterator.next(MarkDuplicatesWithMateCigarIterator.java:228)
    at picard.sam.markduplicates.MarkDuplicatesWithMateCigarIterator.next(MarkDuplicatesWithMateCigarIterator.java:47)
    at picard.sam.markduplicates.MarkDuplicatesWithMateCigar.doWork(MarkDuplicatesWithMateCigar.java:132)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:206)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)

Note the MAX_RECORDS_IN_RAM=2 and MINIMUM_DISTANCE=212. We came across this error by chance on a much larger file, and using the option MAX_RECORDS_IN_RAM=50000 w/ MINIMUM_DISTANCE the same. It seems like these two options are required to cause the error, but the exact value that is required to cause the error depends upon the content of the records. There may be other values where the error occurs, but small deviations around these numbers seems to allow the file to pass through without error. For example, with MAX_RECORDS_IN_RAM=3, or any higher value for that matter, we get the following:

java -jar picard-tools-1.140/picard.jar MarkDuplicatesWithMateCigar INPUT=sorting_error.sam OUTPUT=dup METRICS_FILE=metrics.dup TMP_DIR=. MAX_RECORDS_IN_RAM=3 MINIMUM_DISTANCE=212 VALIDATION_STRINGENCY=STRICT VERBOSITY=DEBUG
[Thu Nov 05 19:47:03 UTC 2015] picard.sam.markduplicates.MarkDuplicatesWithMateCigar MINIMUM_DISTANCE=212 INPUT=[sorting_error.sam] OUTPUT=dup METRICS_FILE=metrics.dup TMP_DIR=[.] VERBOSITY=DEBUG VALIDATION_STRINGENCY=STRICT MAX_RECORDS_IN_RAM=3    SKIP_PAIRS_WITH_NO_MATE_CIGAR=true BLOCK_SIZE=100000 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicatesWithMateCigar REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=TOTAL_MAPPED_REFERENCE_LENGTH READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 QUIET=false COMPRESSION_LEVEL=5 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Thu Nov 05 19:47:03 UTC 2015] Executing as ubuntu@mwm-thelab on Linux 3.13.0-62-generic amd64; OpenJDK 64-Bit Server VM 1.7.0_79-b14; Picard version: 1.140(a81bc82e781dae05c922d1dbcee737334612399f_1444244284) IntelDeflater
DEBUG   2015-11-05 19:47:03 Option  Ignoring EAGERLY_DECODE option; does not apply to PrimitiveSamReaderToSamReaderAdapter readers.
WARNING 2015-11-05 19:47:03 AbstractOpticalDuplicateFinderCommandLineProgram    Default READ_NAME_REGEX '[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).*' did not match read name 'QNAME0'.  You may need to specify a READ_NAME_REGEX in order to correctly identify optical duplicates.  Note that this message will not be emitted again even if other read names do not match the regex.
WARNING 2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Encountered a record with no program record, program group chaining will not occur for this read: QNAME0 1/2 101b aligned read.
INFO    2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Processed 4 records
INFO    2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Found 0 records with no mate cigar optional tag.
INFO    2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Marking 0 records as duplicates.
INFO    2015-11-05 19:47:03 MarkDuplicatesWithMateCigar Found 0 optical duplicate clusters.
[Thu Nov 05 19:47:03 UTC 2015] picard.sam.markduplicates.MarkDuplicatesWithMateCigar done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=252182528

My initial thought was that I was missing some subtle detail of the sam spec, but given that the sam input passes ValidateSamFile, and will pass through without error when supplied with different parameters, albeit with warnings about formatting of the QNAME and lack of PG tags, it seems less likely...

Does this make sense to anyone? Any thoughts or suggestions would be greatly appreciated!

nh13 commented 8 years ago

@MurphyMarkW myself and @bradtaylor wrote this version of MarkDuplicates, so loop us in as we go along. [EDIT: you posted the SAM]. By the way, those are some ugly alignments and basecalls.

nh13 commented 8 years ago

Developer note: Here's the bug in a test case for DiskBackedQueueTest. Likely, once we spill to disk, we cannot add records to ram.

   /** See: https://github.com/broadinstitute/picard/issues/327 */
    @Test
    public void testPathologyIssue327() {
        final File tmpDir = new File("testdata/htsjdk/samtools/util");
        if (!tmpDir.exists()) tmpDir.mkdir();
        tmpDir.deleteOnExit();

        final DiskBackedQueue<String> queue = DiskBackedQueue.newInstance(
                new StringCodec(),
                2, // maxRecordsInRam
                Arrays.asList(tmpDir)
        );

        // testing a particular order of adding to the queue, setting the result state, and emitting.
        queue.add("0");
        queue.add("1");
        queue.add("2");
        Assert.assertEquals(queue.poll(), "0");
        queue.add("3");
        Assert.assertEquals(queue.poll(), "1");
        Assert.assertEquals(queue.poll(), "2");
        Assert.assertEquals(queue.poll(), "3");
    }
nh13 commented 8 years ago

@MurphyMarkW fixed in https://github.com/samtools/htsjdk/pull/376

nh13 commented 8 years ago

@bradtaylor, I said this over email, but it is here for posterity

Here's an idea that should allow you to never have to worry about "canAdd" (get rid of it) in DiskBackedQueue.

Have four internal buffers

  1. primaryRamQueue
  2. primaryDiskQueue
  3. secondaryRamQueue
  4. secondaryDiskQueue
    • You put records into #1 if there is space, #2 if #1 is full.
    • When you poll from #1 and there are records in #2, then you can add into either #3 or #4 (enforce # of records in RAM across #1 & #3).
    • If #1 is empty but #2 is not, poll returns from #2, and add puts in #3 or #4 (ram limited).
    • When both #1 and #2 are both empty, you can swap #1&#3, and #2&#4.
    • Enforce that size (#1+#3) is less than the # of records allowed in RAM

In any case, it is probably a good idea to first think if 3-5 test cases that test out the various switching to the above four queues. Have maxRecordsInRam be something like 2. Also, get rid of the "headRecord" thing, as it is confusing.

MurphyMarkW commented 8 years ago

@nh13 @bradtaylor \o/ Thank you for taking the time to look into this! I didn't think to look in htsjdk for the source of the bug.

I'd be happy to test out changes on our side and report back. And agreed re: ugly calls @nh13. Doctored records due to protected sequences and all that jazz. :P

yfarjoun commented 8 years ago

What's the status of this issue? is this still a bug?

MurphyMarkW commented 8 years ago

@yfarjoun it's been a while but I believe there was a fix applied for this specific case and that it worked for our test cases (specifically the one given above). I believe my group ran into a similar-looking issue, but one that we couldn't reliably reproduce, so there may be something else lurking about? I'd have to dig through our logs for a dataset that reproduces it, but if you or someone else has seen something, I might see about setting aside some time to look into it.

From my side, I'd say this particular bug is resolved and that the issue could be closed. Apologies for not responding more promptly with these results. Kind of fell off the radar.

yfarjoun commented 8 years ago

no worries. @MurphyMarkW just following up to see if can close this issue. please reopen or open a new issue if this raises its head again.