broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
981 stars 369 forks source link

checking args in MarkDuplicates for user error #1261

Closed jayoung closed 5 years ago

jayoung commented 5 years ago

Hi there,

I just spent quite a while tracking down a problem I was having with MarkDuplicates (Picard version: 2.18.22-SNAPSHOT). Google searches seemed to suggest it might be bad read mappings in my bam file, but it turned out it was simply a stupid user error - I was using a bad parameter value : "ADD_PG_TAG_TO_READS=null" - should be "false" not "null" (I've been updating my pipelines from an older version of picard where I was using PG=null and didn't check carefully enough).

That caused a mystery crash after running MarkDuplicates on the entire bam file. It'd be great if picard could check the args before starting to make sure they're valid, and give an informative message rather than the mystery crash.

This is what I should have done: picard.jar MarkDuplicates INPUT=small.bam OUTPUT=small.marked.bam METRICS_FILE=small.marked.metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT ADD_PG_TAG_TO_READS=false

And this is what the crash looks like:

picard.jar MarkDuplicates INPUT=small.bam OUTPUT=small.marked2.bam METRICS_FILE=small.marked2.metrics CREATE_INDEX=true VALIDATION_STRINGENCY=LENIENT ADD_PG_TAG_TO_READS=null

INFO 2019-01-04 11:34:55 MarkDuplicates

** NOTE: Picard's command line syntax is changing.


** For more information, please see: ** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


** The command line looks like this in the new syntax:


** MarkDuplicates -INPUT small.bam -OUTPUT small.marked2.bam -METRICS_FILE small.marked2.metrics -CREATE_INDEX true -VALIDATION_STRINGENCY LENIENT -ADD_PG_TAG_TO_READS null


11:34:55.593 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/fh/fast/malik_h/grp/malik_lab_shared/linux_gizmo/bin/picard.jar!/com/intel/gkl/native/libgkl_compression.so [Fri Jan 04 11:34:55 PST 2019] MarkDuplicates ADD_PG_TAG_TO_READS=null INPUT=[small.bam] OUTPUT=small.marked2.bam METRICS_FILE=small.marked2.metrics VALIDATION_STRINGENCY=LENIENT CREATE_INDEX=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Fri Jan 04 11:34:55 PST 2019] Executing as jayoung@gizmof98.fhcrc.org on Linux 3.13.0-143-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.18.22-SNAPSHOT INFO 2019-01-04 11:34:55 MarkDuplicates Start of doWork freeMemory: 492610088; totalMemory: 504889344; maxMemory: 7460618240 INFO 2019-01-04 11:34:55 MarkDuplicates Reading input file and constructing read end information. INFO 2019-01-04 11:34:55 MarkDuplicates Will retain up to 27031225 data points before spilling to disk. INFO 2019-01-04 11:34:55 MarkDuplicates Read 144 records. 23 pairs never matched. INFO 2019-01-04 11:34:55 MarkDuplicates After buildSortedReadEndLists freeMemory: 493125664; totalMemory: 718274560; maxMemory: 7460618240 INFO 2019-01-04 11:34:55 MarkDuplicates Will retain up to 233144320 duplicate indices before spilling to disk. INFO 2019-01-04 11:34:56 MarkDuplicates Traversing read pair information and detecting duplicates. INFO 2019-01-04 11:34:56 MarkDuplicates Traversing fragment information and detecting duplicates. INFO 2019-01-04 11:34:56 MarkDuplicates Sorting list of duplicate records. INFO 2019-01-04 11:34:56 MarkDuplicates After generateDuplicateIndexes freeMemory: 783432104; totalMemory: 2658664448; maxMemory: 7460618240 INFO 2019-01-04 11:34:56 MarkDuplicates Marking 17 records as duplicates. INFO 2019-01-04 11:34:56 MarkDuplicates Found 0 optical duplicate clusters. INFO 2019-01-04 11:34:56 MarkDuplicates Reads are assumed to be ordered by: coordinate [Fri Jan 04 11:34:57 PST 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.03 minutes. Runtime.totalMemory()=2985295872 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" java.lang.NullPointerException at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:454) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

Thanks for considering it,

Janet Young


Dr. Janet Young

Malik lab http://research.fhcrc.org/malik/en.html

Division of Basic Sciences Fred Hutchinson Cancer Research Center 1100 Fairview Avenue N., A2-025, P.O. Box 19024, Seattle, WA 98109-1024, USA.

tel: (206) 667 4512 email: jayoung ...at... fredhutch.org


cmnbroad commented 5 years ago

Yes, the value checking should be better. The argument value "null" has special meaning in that the parser interprets it to mean that the initial value(s) provided by the tool should be cleared.

There is still ambiguity around the meaning of "optional" though. If ADD_PG_TAG_TO_READS didn't have an initial value, the parser would have insisted that some value besides null be provided. But the initial value causes the parser to treat it as "optional" (meaning not require to have any value), and thus allowed to be null.

This particular issue (null for booleans) is fixed in the Barclay parser, but some of the ambiguity around the meaning of "optional" still exists, and should be addressed there. See https://github.com/broadinstitute/barclay/issues/23, https://github.com/broadinstitute/barclay/issues/115 and https://github.com/broadinstitute/barclay/issues/48.

jayoung commented 5 years ago

Thanks for looking into this - appreciate it. Maybe it's time I add "-Dpicard.useLegacyParser=false" and switch to new syntax.

Janet

lbergelson commented 5 years ago

@cmnbroad Why is PGTagArgumentCollection.ADD_PG_TAG_TO_READS a Boolean instead of a boolean. Changing it to the primitive would probably prevent this problem.

cmnbroad commented 5 years ago

No idea why it was made a Boolean in the first place, but yes, changing it would cause it to be rejected in the legacy parser as well.