broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
967 stars 370 forks source link

[bug] MarkDuplicates throws unexpected IllegalArgumentException with queryname-sorted input #1194

Open RWilton opened 6 years ago

RWilton commented 6 years ago

Bug Report

Affected tool(s)

MarkDuplicates

Affected version(s)

Description

With SAM input having records clustered by QNAME (per the MarkDuplicates documentation) but not sorted as strings, duplicate analysis proceeds successfully but the program then halts with a java exception:

Exception in thread "main" java.lang.IllegalArgumentException: Alignments added out of order in SAMFileWriterImpl.addAlignment for null. Sort order is queryname. Offending records are at [6:1101:9993:8042] and [6:1101:10003:7989]
        at htsjdk.samtools.SAMFileWriterImpl.assertPresorted(SAMFileWriterImpl.java:213)
        at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:200)
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:406)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

Steps to reproduce

This can occur in two straightforward use cases:

Expected behavior

Program should run to completion

Actual behavior

Program throws a java exception

Additional comments

This problem could probably by hacked out by not doing any sort-order validation when SAM output is generated, but a more principled solution would be to perform only the validation required to support MarkDuplicates' unduplication functionality and to ignore the global sort order of the input (or, at most, issue a warning if an inconsistency were detected).

In the worst case, of course, MarkDuplicates can continue to accept only QNAME-ordered input as generated by the Picard SortSam tool, but a requirement that the input be re-sorted in order to satisfy MarkDuplicates becomes increasingly burdensome as data set sizes increase.

yfarjoun commented 6 years ago

fixed by #1195

RWilton commented 6 years ago

This still does not work properly.

Given unsorted SAM input grouped by QNAME (SO=unsorted GO=query), MarkDuplicates throws the following exception:

Exception in thread "main" picard.PicardException: This program requires input that are either coordinate or query sorted (according to the header, or at least ASSUME_SORT_ORDER and the content.) Found ASSUME_SORT_ORDER=null and header sortorder=unsorted
        at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:262)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:277)
        at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
        at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

This can be worked around by using ASSUME_SORT_ORDER=queryname, but then MarkDuplicates changes the SAM file header to SO=unknown GO=query.

Correct behavior would be:

yfarjoun commented 6 years ago

How is that sort-order wrong? (SO=unknown GO=query)

RWilton commented 6 years ago

I have struggled for years to comprehend the details of the SAM format specification document, and I admit that its opacity and lack of precision is still a perennial source of frustration and amusement. Nevertheless, my understanding of the difference between "unsorted" and "unknown" is simply that "unsorted" means "sorted neither by POS nor QNAME" whereas "unknown" means "unknown." In the case where the output contains alignment records grouped by QNAME, the sort order is obviously not "unknown".

In any event:

yfarjoun commented 6 years ago

@RWilton The truth is that htsjdk (on which Picard is based) doesn't have good support for group orders. This is the main reason for the awkward invocation of MarkDuplicates.

That said, I think it should be possible to accept GO=query in the header invoke the MarkDuplicates behavior as ASSUME_SORT_ORDER=queryname.

(I'm not volunteering to making that change....would you like to submit a PR?)

RWilton commented 6 years ago

Sorry but I'm unfamiliar with the process. What is a PR and what do I do to submit it?

moldach commented 4 years ago

I get the same error when using the snakemake wrapper for Picard MarkDuplicates

code

rule alignment:
    input:
        reads=["trimming/trimmomatic/{sample}_R1_trim_paired.fq.gz", "trimming/trimmomatic/{sample}_R2_trim_paired.fq.gz"]
    output:
        os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"],"{sample}.sorted.bam")
    log: os.path.join(dirs_dict["LOG_DIR"],config["ALIGN_TOOL"],"{sample}.log")
    message: """--- Alignment with BWA."""
    threads: 8
    resources:
        mem = 2500,
        time = 100
    params:
        index=os.path.join(dirs_dict["REF_DIR"], config["REF_GENOME"]),
        extra=r"-R '@RG\tID:{sample}\tPL:ILLUMINA\tSM:{sample}'",
        sort="samtools",
        sort_order="queryname",
        sort_extra=""
    wrapper:
        "0.64.0/bio/bwa/mem"

rule samtools_index_bam:
    input: lambda wildcards: getSortedBams(wildcards.sample)
    output: os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"],"{sample}.sorted.bam.bai")
    log: os.path.join(dirs_dict["LOG_DIR"],config["ALIGN_TOOL"],"{sample}_index.log")
    resources:
        mem = 2000,
        time = 60
    message: """--- Index BAM files."""
    params:
        ""
    wrapper:
        "0.64.0/bio/samtools/index"

rule mark_duplicates:
    input:
        lambda wildcards: getSortedBams(wildcards.sample),
    output:
        bam = os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"],"{sample}.sorted.dedupped.bam"),
        metrics = os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"],"{sample}.sorted.dedupped.txt")
    log: os.path.join(dirs_dict["LOG_DIR"],config["ALIGN_TOOL"],"{sample}_markDuplicates.log")
    resources:
        mem = 2000,
        time = 60
    message: """--- Mark duplicates."""
    threads: 1
    params:
        "REMOVE_DUPLICATES=true"
    wrapper:
        "0.64.0/bio/picard/markduplicates"

error

INFO    2020-09-02 20:57:41 MarkDuplicates  

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    MarkDuplicates -REMOVE_DUPLICATES true -INPUT alignment/bwa/MTG325.sorted.bam -OUTPUT alignment/bwa/MTG325.sorted.dedupped.bam -METRICS_FILE alignment/bwa/MTG325.sorted.dedupped.txt
**********

20:57:42.172 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/foo/wrappers/.snakemake/conda/684ad37b/share/picard-2.22.1-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Sep 02 20:57:42 MDT 2020] MarkDuplicates INPUT=[alignment/bwa/MTG325.sorted.bam] OUTPUT=alignment/bwa/MTG325.sorted.dedupped.bam METRICS_FILE=alignment/bwa/MTG325.sorted.dedupped.txt REMOVE_DUPLICATES=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Sep 02 20:57:42 MDT 2020] Executing as FOO on Linux 3.10.0-1127.18.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.8-internal+0-adhoc..src; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.22.1
INFO    2020-09-02 20:57:42 MarkDuplicates  Start of doWork freeMemory: 510708504; totalMemory: 519110656; maxMemory: 2075918336
INFO    2020-09-02 20:57:42 MarkDuplicates  Reading input file and constructing read end information.
INFO    2020-09-02 20:57:42 MarkDuplicates  Will retain up to 7521443 data points before spilling to disk.
INFO    2020-09-02 20:57:42 MarkDuplicates  Seen many non-increasing record positions. Printing Read-names as well.
INFO    2020-09-02 20:57:47 MarkDuplicates  Read     1,000,000 records.  Elapsed time: 00:00:04s.  Time for last 1,000,000:    4s.  Last read position: V:6,942,473.  Last read name: A00987:53:HLKMJDRXX:1:1114:23131:7232
INFO    2020-09-02 20:57:47 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:57:50 MarkDuplicates  Read     2,000,000 records.  Elapsed time: 00:00:08s.  Time for last 1,000,000:    3s.  Last read position: III:9,865,842.  Last read name: A00987:53:HLKMJDRXX:1:1127:31268:17973
INFO    2020-09-02 20:57:50 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:57:53 MarkDuplicates  Read     3,000,000 records.  Elapsed time: 00:00:11s.  Time for last 1,000,000:    2s.  Last read position: II:13,162,966.  Last read name: A00987:53:HLKMJDRXX:1:1140:29930:27837
INFO    2020-09-02 20:57:53 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:57:56 MarkDuplicates  Read     4,000,000 records.  Elapsed time: 00:00:14s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:1:1153:28456:24314
INFO    2020-09-02 20:57:56 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:57:59 MarkDuplicates  Read     5,000,000 records.  Elapsed time: 00:00:17s.  Time for last 1,000,000:    3s.  Last read position: V:13,503,507.  Last read name: A00987:53:HLKMJDRXX:1:1166:32271:14074
INFO    2020-09-02 20:57:59 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:02 MarkDuplicates  Read     6,000,000 records.  Elapsed time: 00:00:20s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:1:1202:7943:36385
INFO    2020-09-02 20:58:02 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:05 MarkDuplicates  Read     7,000,000 records.  Elapsed time: 00:00:23s.  Time for last 1,000,000:    2s.  Last read position: X:10,199,421.  Last read name: A00987:53:HLKMJDRXX:1:1215:15393:32346
INFO    2020-09-02 20:58:05 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:58:09 MarkDuplicates  Read     8,000,000 records.  Elapsed time: 00:00:26s.  Time for last 1,000,000:    3s.  Last read position: I:11,309,476.  Last read name: A00987:53:HLKMJDRXX:1:1228:10520:5400
INFO    2020-09-02 20:58:09 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:12 MarkDuplicates  Read     9,000,000 records.  Elapsed time: 00:00:29s.  Time for last 1,000,000:    3s.  Last read position: X:13,127,423.  Last read name: A00987:53:HLKMJDRXX:1:1240:30572:4460
INFO    2020-09-02 20:58:12 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:15 MarkDuplicates  Read    10,000,000 records.  Elapsed time: 00:00:32s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:1:1253:22408:16501
INFO    2020-09-02 20:58:15 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:18 MarkDuplicates  Read    11,000,000 records.  Elapsed time: 00:00:35s.  Time for last 1,000,000:    2s.  Last read position: III:10,380,120.  Last read name: A00987:53:HLKMJDRXX:1:1266:16568:6104
INFO    2020-09-02 20:58:18 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:29 MarkDuplicates  Read    12,000,000 records.  Elapsed time: 00:00:47s.  Time for last 1,000,000:   11s.  Last read position: III:9,820,195.  Last read name: A00987:53:HLKMJDRXX:1:2101:18078:31704
INFO    2020-09-02 20:58:29 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:32 MarkDuplicates  Read    13,000,000 records.  Elapsed time: 00:00:50s.  Time for last 1,000,000:    2s.  Last read position: II:10,121,616.  Last read name: A00987:53:HLKMJDRXX:1:2115:17092:19946
INFO    2020-09-02 20:58:32 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:58:35 MarkDuplicates  Read    14,000,000 records.  Elapsed time: 00:00:53s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:1:2129:30300:15170
INFO    2020-09-02 20:58:35 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:38 MarkDuplicates  Read    15,000,000 records.  Elapsed time: 00:00:56s.  Time for last 1,000,000:    3s.  Last read position: IV:1,814,853.  Last read name: A00987:53:HLKMJDRXX:1:2144:10565:33348
INFO    2020-09-02 20:58:38 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:58:41 MarkDuplicates  Read    16,000,000 records.  Elapsed time: 00:00:59s.  Time for last 1,000,000:    2s.  Last read position: IV:8,208,722.  Last read name: A00987:53:HLKMJDRXX:1:2158:27778:17096
INFO    2020-09-02 20:58:41 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:58:44 MarkDuplicates  Read    17,000,000 records.  Elapsed time: 00:01:02s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:1:2172:19768:9706
INFO    2020-09-02 20:58:44 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:47 MarkDuplicates  Read    18,000,000 records.  Elapsed time: 00:01:05s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:1:2207:29279:5572
INFO    2020-09-02 20:58:47 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:50 MarkDuplicates  Read    19,000,000 records.  Elapsed time: 00:01:08s.  Time for last 1,000,000:    2s.  Last read position: X:14,146,489.  Last read name: A00987:53:HLKMJDRXX:1:2221:27489:8202
INFO    2020-09-02 20:58:50 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:53 MarkDuplicates  Read    20,000,000 records.  Elapsed time: 00:01:11s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:1:2235:29387:3944
INFO    2020-09-02 20:58:53 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:58:56 MarkDuplicates  Read    21,000,000 records.  Elapsed time: 00:01:14s.  Time for last 1,000,000:    2s.  Last read position: X:8,058,975.  Last read name: A00987:53:HLKMJDRXX:1:2250:3568:24737
INFO    2020-09-02 20:58:56 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:59:00 MarkDuplicates  Read    22,000,000 records.  Elapsed time: 00:01:18s.  Time for last 1,000,000:    3s.  Last read position: II:60,274.  Last read name: A00987:53:HLKMJDRXX:1:2264:5547:14011
INFO    2020-09-02 20:59:00 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:59:18 MarkDuplicates  Read    23,000,000 records.  Elapsed time: 00:01:36s.  Time for last 1,000,000:   18s.  Last read position: V:4,110,005.  Last read name: A00987:53:HLKMJDRXX:1:2277:14714:25817
INFO    2020-09-02 20:59:18 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:21 MarkDuplicates  Read    24,000,000 records.  Elapsed time: 00:01:39s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:2:1113:10140:6089
INFO    2020-09-02 20:59:21 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:24 MarkDuplicates  Read    25,000,000 records.  Elapsed time: 00:01:42s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:2:1126:20256:2221
INFO    2020-09-02 20:59:24 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:27 MarkDuplicates  Read    26,000,000 records.  Elapsed time: 00:01:45s.  Time for last 1,000,000:    3s.  Last read position: X:3,480,962.  Last read name: A00987:53:HLKMJDRXX:2:1139:32054:29606
INFO    2020-09-02 20:59:27 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:30 MarkDuplicates  Read    27,000,000 records.  Elapsed time: 00:01:48s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:2:1153:3992:12132
INFO    2020-09-02 20:59:30 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:33 MarkDuplicates  Read    28,000,000 records.  Elapsed time: 00:01:51s.  Time for last 1,000,000:    2s.  Last read position: III:1,898,012.  Last read name: A00987:53:HLKMJDRXX:2:1166:3296:5885
INFO    2020-09-02 20:59:33 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:37 MarkDuplicates  Read    29,000,000 records.  Elapsed time: 00:01:54s.  Time for last 1,000,000:    3s.  Last read position: II:11,122,617.  Last read name: A00987:53:HLKMJDRXX:2:1201:8874:20431
INFO    2020-09-02 20:59:37 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:59:39 MarkDuplicates  Read    30,000,000 records.  Elapsed time: 00:01:57s.  Time for last 1,000,000:    2s.  Last read position: V:19,878,808.  Last read name: A00987:53:HLKMJDRXX:2:1214:21233:28494
INFO    2020-09-02 20:59:39 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 20:59:42 MarkDuplicates  Read    31,000,000 records.  Elapsed time: 00:02:00s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:2:1227:25012:4006
INFO    2020-09-02 20:59:42 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:45 MarkDuplicates  Read    32,000,000 records.  Elapsed time: 00:02:03s.  Time for last 1,000,000:    2s.  Last read position: X:16,334,136.  Last read name: A00987:53:HLKMJDRXX:2:1240:31991:9987
INFO    2020-09-02 20:59:45 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:48 MarkDuplicates  Read    33,000,000 records.  Elapsed time: 00:02:06s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:2:1254:6867:5682
INFO    2020-09-02 20:59:48 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 20:59:51 MarkDuplicates  Read    34,000,000 records.  Elapsed time: 00:02:09s.  Time for last 1,000,000:    2s.  Last read position: II:1,300,008.  Last read name: A00987:53:HLKMJDRXX:2:1267:7509:8860
INFO    2020-09-02 20:59:51 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:01 MarkDuplicates  Read    35,000,000 records.  Elapsed time: 00:02:19s.  Time for last 1,000,000:    9s.  Last read position: IV:7,401,409.  Last read name: A00987:53:HLKMJDRXX:2:2102:23972:4085
INFO    2020-09-02 21:00:01 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 21:00:04 MarkDuplicates  Read    36,000,000 records.  Elapsed time: 00:02:22s.  Time for last 1,000,000:    2s.  Last read position: III:10,225,368.  Last read name: A00987:53:HLKMJDRXX:2:2117:11686:35853
INFO    2020-09-02 21:00:04 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:07 MarkDuplicates  Read    37,000,000 records.  Elapsed time: 00:02:25s.  Time for last 1,000,000:    2s.  Last read position: X:12,551,293.  Last read name: A00987:53:HLKMJDRXX:2:2132:13132:14403
INFO    2020-09-02 21:00:07 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 21:00:10 MarkDuplicates  Read    38,000,000 records.  Elapsed time: 00:02:27s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:2:2147:21513:4648
INFO    2020-09-02 21:00:10 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:13 MarkDuplicates  Read    39,000,000 records.  Elapsed time: 00:02:30s.  Time for last 1,000,000:    2s.  Last read position: IV:601,530.  Last read name: A00987:53:HLKMJDRXX:2:2162:4878:4272
INFO    2020-09-02 21:00:13 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:16 MarkDuplicates  Read    40,000,000 records.  Elapsed time: 00:02:33s.  Time for last 1,000,000:    2s.  Last read position: IV:16,184,593.  Last read name: A00987:53:HLKMJDRXX:2:2176:3748:32597
INFO    2020-09-02 21:00:16 MarkDuplicates  Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO    2020-09-02 21:00:19 MarkDuplicates  Read    41,000,000 records.  Elapsed time: 00:02:36s.  Time for last 1,000,000:    2s.  Last read position: X:17,334,102.  Last read name: A00987:53:HLKMJDRXX:2:2212:5999:33646
INFO    2020-09-02 21:00:19 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:21 MarkDuplicates  Read    42,000,000 records.  Elapsed time: 00:02:39s.  Time for last 1,000,000:    2s.  Last read position: I:7,023,334.  Last read name: A00987:53:HLKMJDRXX:2:2226:32687:19617
INFO    2020-09-02 21:00:21 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:24 MarkDuplicates  Read    43,000,000 records.  Elapsed time: 00:02:42s.  Time for last 1,000,000:    2s.  Last read position: X:3,011,110.  Last read name: A00987:53:HLKMJDRXX:2:2242:20537:5995
INFO    2020-09-02 21:00:24 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:28 MarkDuplicates  Read    44,000,000 records.  Elapsed time: 00:02:46s.  Time for last 1,000,000:    4s.  Last read position: III:7,428,654.  Last read name: A00987:53:HLKMJDRXX:2:2257:18756:14967
INFO    2020-09-02 21:00:28 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:31 MarkDuplicates  Read    45,000,000 records.  Elapsed time: 00:02:49s.  Time for last 1,000,000:    2s.  Last read position: */*.  Last read name: A00987:53:HLKMJDRXX:2:2271:25672:20682
INFO    2020-09-02 21:00:31 MarkDuplicates  Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO    2020-09-02 21:00:33 MarkDuplicates  Read 45528707 records. 0 pairs never matched.
INFO    2020-09-02 21:00:49 MarkDuplicates  After buildSortedReadEndLists freeMemory: 2058495896; totalMemory: 2075918336; maxMemory: 2075918336
INFO    2020-09-02 21:00:49 MarkDuplicates  Will retain up to 64872448 duplicate indices before spilling to disk.
INFO    2020-09-02 21:00:49 MarkDuplicates  Traversing read pair information and detecting duplicates.
INFO    2020-09-02 21:00:55 MarkDuplicates  Traversing fragment information and detecting duplicates.
INFO    2020-09-02 21:01:03 MarkDuplicates  Sorting list of duplicate records.
INFO    2020-09-02 21:01:03 MarkDuplicates  After generateDuplicateIndexes freeMemory: 1539430992; totalMemory: 2075918336; maxMemory: 2075918336
INFO    2020-09-02 21:01:03 MarkDuplicates  Marking 914823 records as duplicates.
INFO    2020-09-02 21:01:03 MarkDuplicates  Found 89855 optical duplicate clusters.
INFO    2020-09-02 21:01:03 MarkDuplicates  Reads are assumed to be ordered by: queryname
[Wed Sep 02 21:01:03 MDT 2020] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 3.36 minutes.
Runtime.totalMemory()=2075918336
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Alignments added out of order in SAMFileWriterImpl.addAlignment for file:///home/foo/wrappers/alignment/bwa/MTG325.sorted.dedupped.bam. Sort order is queryname. Offending records are at [A00987:53:HLKMJDRXX:1:1101:1018:7326] and [A00987:53:HLKMJDRXX:1:1101:1018:14278]
    at htsjdk.samtools.SAMFileWriterImpl.assertPresorted(SAMFileWriterImpl.java:197)
    at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:184)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:425)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
moldach commented 3 years ago

Also getting this error from trying to run picard AddOrReplaceReadGroups: https://github.com/zlskidmore/docker-picard/issues/2

yfarjoun commented 3 years ago

are you sure that the input files are, in fact, queryname sorted? do they pass Picard's ValidateSamFile?

esamorodnitsky commented 3 years ago

Looks like I am getting this error too, but with MarkIlluminaAdapters using unmapped BAMs.

esamorodnitsky commented 3 years ago

Looks like I am getting this error too, but with MarkIlluminaAdapters using unmapped BAMs.

I was able to get around this error by renaming my reads like "000000001", "000000002", "000000003", etc.

Neato-Nick commented 1 month ago

Old issue, but got here from a google search. I got this error as well, also in MarkDuplicates. I solved with a different workaround. For some reason, lexicographical query sort (samtools sort -N) passes ValidateSamFile and works in MarkDuplicates, but alpha-numeric query sort (samtools sort -n) does not. I liked this workaround a lot better than re-naming the reads. FWIW, in 2020, the lexicographical sort may not even have been implemented yet.