broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
984 stars 369 forks source link

MarkDuplicates should validate header before doing actual work #1718

Open SHuang-Broad opened 3 years ago

SHuang-Broad commented 3 years ago

Summary:

Not a bug report, not a feature request, not a documentation requst. So I apologize for the "free-format" report.

An improvement possible to MarkDuplicates is to validate headerlines before doing the hard work of actually marking (and optionally removing) duplicates, i.e. fail early on corrupt data.

Thanks! Steve

Report:

Command:


gatk MarkDuplicates \
    -I HG00732.1-p.name-sorted.reheader.bam \
    -O HG00732.1-p.name-sorted.mark-duplicates.bam \
    -M HG00732.1-p.name-sorted.mark-duplicates-metrics.txt \
    --REMOVE_DUPLICATES

Expected:

The metrics file and the bam with duplicates removed as requested.

Received:

[Tue Sep 14 15:01:52 UTC 2021] Executing as shuang@shuang-crazy-monkey on Linux 5.4.0-1052-gcp amd64; OpenJDK 64-Bit Server VM 1.8.0_292-8u292-b10-0ubuntu1~18.04-b10; Deflater: Intel; Inflater: Intel; Provider GCS is available; Picard version: Version:2.23.3
INFO    2021-09-14 15:01:52 MarkDuplicates  Start of doWork freeMemory: 158341584; totalMemory: 167247872; maxMemory: 1737490432
INFO    2021-09-14 15:01:52 MarkDuplicates  Reading input file and constructing read end information.
INFO    2021-09-14 15:01:52 MarkDuplicates  Will retain up to 6295255 data points before spilling to disk.
[Tue Sep 14 15:01:52 UTC 2021] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.01 minutes.
Runtime.totalMemory()=167247872
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMFormatException: Error parsing SAM header. DT tag value 'unknown' is not parseable as a date. Line:
@RG ID:6061-SL-0002a    SM:HG00732  PL:ONT  PU:unknown  DT:unknown; File /home/shuang/dup_ont/HG00732.1-p.name-sorted.reheader.bam; Line number 197
    at htsjdk.samtools.SAMTextHeaderCodec.reportErrorParsingLine(SAMTextHeaderCodec.java:258)
    at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:195)
    at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:110)
    at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:704)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:298)
    at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:176)
    at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:396)
    at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:262)
    at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:508)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:257)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:301)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)
Caused by: htsjdk.samtools.util.DateParser$InvalidDateException: Could not parse as date: unknown
    at htsjdk.samtools.TextTagCodec.decodeDate(TextTagCodec.java:332)
    at htsjdk.samtools.SAMTextHeaderCodec.parseRGLine(SAMTextHeaderCodec.java:190)
    ... 13 more
Caused by: java.text.ParseException: Unparseable date: "unknown"
    at java.text.DateFormat.parse(DateFormat.java:366)
    at htsjdk.samtools.TextTagCodec.decodeDate(TextTagCodec.java:327)
    ... 14 more

Offending headerlines (I'm to blame for writing "unknown" for DT)

$ samview -H HG00732.1-p.name-sorted.reheader.bam | grep -vF '@SQ'
@HD VN:1.6  SO:queryname
@RG ID:6061-SL-0002a    SM:HG00732  PL:ONT  PU:unknown  DT:unknown
@RG ID:6061-SL-0002b    SM:HG00732  PL:ONT  PU:unknown  DT:unknown
@RG ID:6061-SL-0002c    SM:HG00732  PL:ONT  PU:unknown  DT:unknown
@RG ID:6061-SL-0002d    SM:HG00732  PL:ONT  PU:unknown  DT:unknown
@PG ID:minimap2 PN:minimap2 VN:2.17-r941    CL:minimap2 -ayYL --MD -x map-ont -R @RG\tID:6061-SL-0002a\tSM:HG00732\tPL:ONT\tPU:\tDT: -t 4 /cromwell_root/broad-dsde-methods-long-reads/resources/references/grch38_noalt/GCA_000001405.15_GRCh38_no_alt_analysis_set.fa -
@PG ID:samtools PN:samtools PP:minimap2 VN:1.13 CL:samtools view -bhX -M -@ 1 --verbosity=8 --write-index -o HG00732.1-p.bam##idx##HG00732.1-p.bam.bai gs://fc-secure-b35c6ccb-068e-413f-af2c-7181a16fa5e1/4d749bf4-c32b-4477-84c1-10f1aab72dfe/ONTWholeGenome/b51cb971-643c-496e-a918-1b65db8b3168/call-MergeAllReads/cacheCopy/HG00732.bam /cromwell_root/fc-secure-b35c6ccb-068e-413f-af2c-7181a16fa5e1/4d749bf4-c32b-4477-84c1-10f1aab72dfe/ONTWholeGenome/b51cb971-643c-496e-a918-1b65db8b3168/call-MergeAllReads/cacheCopy/HG00732.bam.bai chr1:1-123400000
@PG ID:samtools.1   PN:samtools PP:samtools VN:1.13 CL:samtools sort -n -@ 1 -o HG00732.1-p.name-sorted.bam HG00732.1-p.bam
@PG ID:samtools.2   PN:samtools PP:samtools.1   VN:1.13 CL:samtools view -H HG00732.1-p.name-sorted.bam
@PG ID:samtools.3   PN:samtools PP:samtools.2   VN:1.13 CL:samtools reheader header.fixed.txt HG00732.1-p.name-sorted.bam
@PG ID:samtools.4   PN:samtools PP:samtools.3   VN:1.13 CL:samtools view -h -o HG00732.1-p.name-sorted.reheader.bam -
@PG ID:samtools.5   PN:samtools PP:samtools.4   VN:1.13 CL:samtools view -H HG00732.1-p.name-sorted.reheader.bam
gbggrant commented 3 years ago

Hi @SHuang-Broad thanks for the report. This seems like it could be a bug. We'll take a look.