Closed MurphyMarkW closed 8 years ago
@MurphyMarkW myself and @bradtaylor wrote this version of MarkDuplicates, so loop us in as we go along. [EDIT: you posted the SAM]. By the way, those are some ugly alignments and basecalls.
Developer note: Here's the bug in a test case for DiskBackedQueueTest
. Likely, once we spill to disk, we cannot add records to ram.
/** See: https://github.com/broadinstitute/picard/issues/327 */
@Test
public void testPathologyIssue327() {
final File tmpDir = new File("testdata/htsjdk/samtools/util");
if (!tmpDir.exists()) tmpDir.mkdir();
tmpDir.deleteOnExit();
final DiskBackedQueue<String> queue = DiskBackedQueue.newInstance(
new StringCodec(),
2, // maxRecordsInRam
Arrays.asList(tmpDir)
);
// testing a particular order of adding to the queue, setting the result state, and emitting.
queue.add("0");
queue.add("1");
queue.add("2");
Assert.assertEquals(queue.poll(), "0");
queue.add("3");
Assert.assertEquals(queue.poll(), "1");
Assert.assertEquals(queue.poll(), "2");
Assert.assertEquals(queue.poll(), "3");
}
@MurphyMarkW fixed in https://github.com/samtools/htsjdk/pull/376
@bradtaylor, I said this over email, but it is here for posterity
Here's an idea that should allow you to never have to worry about "canAdd" (get rid of it) in DiskBackedQueue
.
Have four internal buffers
#1
if there is space, #2
if #1
is full.#2
, then you can add into either #3
or #4
(enforce # of records in RAM across #1
& #3
).#1
is empty but #2
is not, poll returns from #2
, and add puts in #3
or #4
(ram limited).#1
and #2
are both empty, you can swap #1

, and #2

.#1
+#3
) is less than the # of records allowed in RAMIn any case, it is probably a good idea to first think if 3-5 test cases that test out the various switching to the above four queues. Have maxRecordsInRam be something like 2. Also, get rid of the "headRecord" thing, as it is confusing.
@nh13 @bradtaylor \o/ Thank you for taking the time to look into this! I didn't think to look in htsjdk for the source of the bug.
I'd be happy to test out changes on our side and report back. And agreed re: ugly calls @nh13. Doctored records due to protected sequences and all that jazz. :P
What's the status of this issue? is this still a bug?
@yfarjoun it's been a while but I believe there was a fix applied for this specific case and that it worked for our test cases (specifically the one given above). I believe my group ran into a similar-looking issue, but one that we couldn't reliably reproduce, so there may be something else lurking about? I'd have to dig through our logs for a dataset that reproduces it, but if you or someone else has seen something, I might see about setting aside some time to look into it.
From my side, I'd say this particular bug is resolved and that the issue could be closed. Apologies for not responding more promptly with these results. Kind of fell off the radar.
no worries. @MurphyMarkW just following up to see if can close this issue. please reopen or open a new issue if this raises its head again.
My group ran across this while running some TCGA datasets and I was hoping someone might have some deeper insight into what's going on. The files we're working with are (very) large and non-public. Because of that, I boiled down the inputs while maintaining the error so that I could share them.
It appears that picard MarkDuplicatesWithMateCigar will complain about reads being out of order when run with certain options and supplied with certain sorted inputs. The input sam passes without error strict ValidateSamFile, and will actually pass through MarkDuplicatesWithMateCigar perfectly fine depending on the settings used in the command.
The input I'm currently using to produce the error is:
Strict ValidateSamFile for good measure:
And the command that will produce the error is:
Note the
MAX_RECORDS_IN_RAM=2
andMINIMUM_DISTANCE=212
. We came across this error by chance on a much larger file, and using the optionMAX_RECORDS_IN_RAM=50000
w/MINIMUM_DISTANCE
the same. It seems like these two options are required to cause the error, but the exact value that is required to cause the error depends upon the content of the records. There may be other values where the error occurs, but small deviations around these numbers seems to allow the file to pass through without error. For example, withMAX_RECORDS_IN_RAM=3
, or any higher value for that matter, we get the following:My initial thought was that I was missing some subtle detail of the sam spec, but given that the sam input passes ValidateSamFile, and will pass through without error when supplied with different parameters, albeit with warnings about formatting of the QNAME and lack of PG tags, it seems less likely...
Does this make sense to anyone? Any thoughts or suggestions would be greatly appreciated!