Not a bug report, not a feature request, not a documentation requst.
So I apologize for the "free-format" report.
An improvement possible to MarkDuplicates is to validate headerlines before doing the hard work of actually marking (and optionally removing) duplicates, i.e. fail early on corrupt data.
The metrics file and the bam with duplicates removed as requested.
Received:
[Tue Sep 14 15:01:52 UTC 2021] Executing as shuang@shuang-crazy-monkey on Linux 5.4.0-1052-gcp amd64; OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:2.23.3
INFO 2021-09-14 15:01:52 MarkDuplicates Start of doWork freeMemory: 158341584; totalMemory: 167247872; maxMemory: 1737490432
INFO 2021-09-14 15:01:52 MarkDuplicates Reading input file and constructing read end information.
INFO 2021-09-14 15:01:52 MarkDuplicates Will retain up to 6295255 data points before spilling to disk.
[Tue Sep 14 15:01:52 UTC 2021] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=167247872
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMFormatException: Error parsing SAM header. DT tag value 'unknown' is not parseable as a date. Line:
@RG ID:6061-SL-0002a SM:HG00732 PL:ONT PU:unknown DT:unknown; File /home/shuang/dup_ont/HG00732.1-p.name-sorted.reheader.bam; Line number 197
at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:258)
at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:195)
at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:110)
at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:262)
at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:508)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:301)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: htsjdk.samtools.util.DateParser$InvalidDateException: Could not parse as date: unknown
at htsjdk.samtools.TextTagCodec.decodeDate(TextTagCodec.java:332)
at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:190)
... 13 more
Caused by: java.text.ParseException: Unparseable date: "unknown"
at java.text.DateFormat.parse(DateFormat.java:366)
at htsjdk.samtools.TextTagCodec.decodeDate(TextTagCodec.java:327)
... 14 more
Offending headerlines (I'm to blame for writing "unknown" for DT)
Summary:
Not a bug report, not a feature request, not a documentation requst. So I apologize for the "free-format" report.
An improvement possible to
MarkDuplicates
is to validate headerlines before doing the hard work of actually marking (and optionally removing) duplicates, i.e. fail early on corrupt data.Thanks! Steve
Report:
Command:
Expected:
The metrics file and the bam with duplicates removed as requested.
Received:
Offending headerlines (I'm to blame for writing "unknown" for
DT
)