Open RWilton opened 6 years ago
fixed by #1195
This still does not work properly.
Given unsorted SAM input grouped by QNAME (SO=unsorted GO=query
), MarkDuplicates throws the following exception:
Exception in thread "main" picard.PicardException: This program requires input that are either coordinate or query sorted (according to the header, or at least ASSUME_SORT_ORDER and the content.) Found ASSUME_SORT_ORDER=null and header sortorder=unsorted
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:262)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:277)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
This can be worked around by using ASSUME_SORT_ORDER=queryname
, but then MarkDuplicates changes the SAM file header to SO=unknown GO=query
.
Correct behavior would be:
How is that sort-order wrong? (SO=unknown GO=query
)
I have struggled for years to comprehend the details of the SAM format specification document, and I admit that its opacity and lack of precision is still a perennial source of frustration and amusement. Nevertheless, my understanding of the difference between "unsorted" and "unknown" is simply that "unsorted" means "sorted neither by POS nor QNAME" whereas "unknown" means "unknown." In the case where the output contains alignment records grouped by QNAME, the sort order is obviously not "unknown".
In any event:
ASSUME_SORT_ORDER=queryname
in order for it to emit an unduplicated SAM file without throwing an exception. SO=unsorted GO=query
(i.e., grouped by QNAME, not globally sorted) should suffice.SO=unsorted GO=query
is no longer true, it should not be fooling around with the SAM header record, regardless of how one might interpret "unknown" versus "unsorted" in this context.@RWilton The truth is that htsjdk (on which Picard is based) doesn't have good support for group orders. This is the main reason for the awkward invocation of MarkDuplicates.
That said, I think it should be possible to accept GO=query
in the header invoke the MarkDuplicates behavior as ASSUME_SORT_ORDER=queryname
.
(I'm not volunteering to making that change....would you like to submit a PR?)
Sorry but I'm unfamiliar with the process. What is a PR and what do I do to submit it?
I get the same error when using the snakemake wrapper for Picard MarkDuplicates
rule alignment:
input:
reads=["trimming/trimmomatic/{sample}_R1_trim_paired.fq.gz", "trimming/trimmomatic/{sample}_R2_trim_paired.fq.gz"]
output:
os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"],"{sample}.sorted.bam")
log: os.path.join(dirs_dict["LOG_DIR"],config["ALIGN_TOOL"],"{sample}.log")
message: """--- Alignment with BWA."""
threads: 8
resources:
mem = 2500,
time = 100
params:
index=os.path.join(dirs_dict["REF_DIR"], config["REF_GENOME"]),
extra=r"-R '@RG\tID:{sample}\tPL:ILLUMINA\tSM:{sample}'",
sort="samtools",
sort_order="queryname",
sort_extra=""
wrapper:
"0.64.0/bio/bwa/mem"
rule samtools_index_bam:
input: lambda wildcards: getSortedBams(wildcards.sample)
output: os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"],"{sample}.sorted.bam.bai")
log: os.path.join(dirs_dict["LOG_DIR"],config["ALIGN_TOOL"],"{sample}_index.log")
resources:
mem = 2000,
time = 60
message: """--- Index BAM files."""
params:
""
wrapper:
"0.64.0/bio/samtools/index"
rule mark_duplicates:
input:
lambda wildcards: getSortedBams(wildcards.sample),
output:
bam = os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"],"{sample}.sorted.dedupped.bam"),
metrics = os.path.join(dirs_dict["ALIGN_DIR"],config["ALIGN_TOOL"],"{sample}.sorted.dedupped.txt")
log: os.path.join(dirs_dict["LOG_DIR"],config["ALIGN_TOOL"],"{sample}_markDuplicates.log")
resources:
mem = 2000,
time = 60
message: """--- Mark duplicates."""
threads: 1
params:
"REMOVE_DUPLICATES=true"
wrapper:
"0.64.0/bio/picard/markduplicates"
INFO 2020-09-02 20:57:41 MarkDuplicates
********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
********** MarkDuplicates -REMOVE_DUPLICATES true -INPUT alignment/bwa/MTG325.sorted.bam -OUTPUT alignment/bwa/MTG325.sorted.dedupped.bam -METRICS_FILE alignment/bwa/MTG325.sorted.dedupped.txt
**********
20:57:42.172 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/home/foo/wrappers/.snakemake/conda/684ad37b/share/picard-2.22.1-0/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Wed Sep 02 20:57:42 MDT 2020] MarkDuplicates INPUT=[alignment/bwa/MTG325.sorted.bam] OUTPUT=alignment/bwa/MTG325.sorted.dedupped.bam METRICS_FILE=alignment/bwa/MTG325.sorted.dedupped.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 TAG_DUPLICATE_SET_MEMBERS=false REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag CLEAR_DT=true DUPLEX_UMI=false ADD_PG_TAG_TO_READS=true ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 MAX_OPTICAL_DUPLICATE_SET_SIZE=300000 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Wed Sep 02 20:57:42 MDT 2020] Executing as FOO on Linux 3.10.0-1127.18.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 11.0.8-internal+0-adhoc..src; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.22.1
INFO 2020-09-02 20:57:42 MarkDuplicates Start of doWork freeMemory: 510708504; totalMemory: 519110656; maxMemory: 2075918336
INFO 2020-09-02 20:57:42 MarkDuplicates Reading input file and constructing read end information.
INFO 2020-09-02 20:57:42 MarkDuplicates Will retain up to 7521443 data points before spilling to disk.
INFO 2020-09-02 20:57:42 MarkDuplicates Seen many non-increasing record positions. Printing Read-names as well.
INFO 2020-09-02 20:57:47 MarkDuplicates Read 1,000,000 records. Elapsed time: 00:00:04s. Time for last 1,000,000: 4s. Last read position: V:6,942,473. Last read name: A00987:53:HLKMJDRXX:1:1114:23131:7232
INFO 2020-09-02 20:57:47 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:57:50 MarkDuplicates Read 2,000,000 records. Elapsed time: 00:00:08s. Time for last 1,000,000: 3s. Last read position: III:9,865,842. Last read name: A00987:53:HLKMJDRXX:1:1127:31268:17973
INFO 2020-09-02 20:57:50 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:57:53 MarkDuplicates Read 3,000,000 records. Elapsed time: 00:00:11s. Time for last 1,000,000: 2s. Last read position: II:13,162,966. Last read name: A00987:53:HLKMJDRXX:1:1140:29930:27837
INFO 2020-09-02 20:57:53 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:57:56 MarkDuplicates Read 4,000,000 records. Elapsed time: 00:00:14s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:1:1153:28456:24314
INFO 2020-09-02 20:57:56 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:57:59 MarkDuplicates Read 5,000,000 records. Elapsed time: 00:00:17s. Time for last 1,000,000: 3s. Last read position: V:13,503,507. Last read name: A00987:53:HLKMJDRXX:1:1166:32271:14074
INFO 2020-09-02 20:57:59 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:02 MarkDuplicates Read 6,000,000 records. Elapsed time: 00:00:20s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:1:1202:7943:36385
INFO 2020-09-02 20:58:02 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:05 MarkDuplicates Read 7,000,000 records. Elapsed time: 00:00:23s. Time for last 1,000,000: 2s. Last read position: X:10,199,421. Last read name: A00987:53:HLKMJDRXX:1:1215:15393:32346
INFO 2020-09-02 20:58:05 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:58:09 MarkDuplicates Read 8,000,000 records. Elapsed time: 00:00:26s. Time for last 1,000,000: 3s. Last read position: I:11,309,476. Last read name: A00987:53:HLKMJDRXX:1:1228:10520:5400
INFO 2020-09-02 20:58:09 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:12 MarkDuplicates Read 9,000,000 records. Elapsed time: 00:00:29s. Time for last 1,000,000: 3s. Last read position: X:13,127,423. Last read name: A00987:53:HLKMJDRXX:1:1240:30572:4460
INFO 2020-09-02 20:58:12 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:15 MarkDuplicates Read 10,000,000 records. Elapsed time: 00:00:32s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:1:1253:22408:16501
INFO 2020-09-02 20:58:15 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:18 MarkDuplicates Read 11,000,000 records. Elapsed time: 00:00:35s. Time for last 1,000,000: 2s. Last read position: III:10,380,120. Last read name: A00987:53:HLKMJDRXX:1:1266:16568:6104
INFO 2020-09-02 20:58:18 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:29 MarkDuplicates Read 12,000,000 records. Elapsed time: 00:00:47s. Time for last 1,000,000: 11s. Last read position: III:9,820,195. Last read name: A00987:53:HLKMJDRXX:1:2101:18078:31704
INFO 2020-09-02 20:58:29 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:32 MarkDuplicates Read 13,000,000 records. Elapsed time: 00:00:50s. Time for last 1,000,000: 2s. Last read position: II:10,121,616. Last read name: A00987:53:HLKMJDRXX:1:2115:17092:19946
INFO 2020-09-02 20:58:32 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:58:35 MarkDuplicates Read 14,000,000 records. Elapsed time: 00:00:53s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:1:2129:30300:15170
INFO 2020-09-02 20:58:35 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:38 MarkDuplicates Read 15,000,000 records. Elapsed time: 00:00:56s. Time for last 1,000,000: 3s. Last read position: IV:1,814,853. Last read name: A00987:53:HLKMJDRXX:1:2144:10565:33348
INFO 2020-09-02 20:58:38 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:58:41 MarkDuplicates Read 16,000,000 records. Elapsed time: 00:00:59s. Time for last 1,000,000: 2s. Last read position: IV:8,208,722. Last read name: A00987:53:HLKMJDRXX:1:2158:27778:17096
INFO 2020-09-02 20:58:41 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:58:44 MarkDuplicates Read 17,000,000 records. Elapsed time: 00:01:02s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:1:2172:19768:9706
INFO 2020-09-02 20:58:44 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:47 MarkDuplicates Read 18,000,000 records. Elapsed time: 00:01:05s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:1:2207:29279:5572
INFO 2020-09-02 20:58:47 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:50 MarkDuplicates Read 19,000,000 records. Elapsed time: 00:01:08s. Time for last 1,000,000: 2s. Last read position: X:14,146,489. Last read name: A00987:53:HLKMJDRXX:1:2221:27489:8202
INFO 2020-09-02 20:58:50 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:53 MarkDuplicates Read 20,000,000 records. Elapsed time: 00:01:11s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:1:2235:29387:3944
INFO 2020-09-02 20:58:53 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:58:56 MarkDuplicates Read 21,000,000 records. Elapsed time: 00:01:14s. Time for last 1,000,000: 2s. Last read position: X:8,058,975. Last read name: A00987:53:HLKMJDRXX:1:2250:3568:24737
INFO 2020-09-02 20:58:56 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:59:00 MarkDuplicates Read 22,000,000 records. Elapsed time: 00:01:18s. Time for last 1,000,000: 3s. Last read position: II:60,274. Last read name: A00987:53:HLKMJDRXX:1:2264:5547:14011
INFO 2020-09-02 20:59:00 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:59:18 MarkDuplicates Read 23,000,000 records. Elapsed time: 00:01:36s. Time for last 1,000,000: 18s. Last read position: V:4,110,005. Last read name: A00987:53:HLKMJDRXX:1:2277:14714:25817
INFO 2020-09-02 20:59:18 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:21 MarkDuplicates Read 24,000,000 records. Elapsed time: 00:01:39s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:2:1113:10140:6089
INFO 2020-09-02 20:59:21 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:24 MarkDuplicates Read 25,000,000 records. Elapsed time: 00:01:42s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:2:1126:20256:2221
INFO 2020-09-02 20:59:24 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:27 MarkDuplicates Read 26,000,000 records. Elapsed time: 00:01:45s. Time for last 1,000,000: 3s. Last read position: X:3,480,962. Last read name: A00987:53:HLKMJDRXX:2:1139:32054:29606
INFO 2020-09-02 20:59:27 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:30 MarkDuplicates Read 27,000,000 records. Elapsed time: 00:01:48s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:2:1153:3992:12132
INFO 2020-09-02 20:59:30 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:33 MarkDuplicates Read 28,000,000 records. Elapsed time: 00:01:51s. Time for last 1,000,000: 2s. Last read position: III:1,898,012. Last read name: A00987:53:HLKMJDRXX:2:1166:3296:5885
INFO 2020-09-02 20:59:33 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:37 MarkDuplicates Read 29,000,000 records. Elapsed time: 00:01:54s. Time for last 1,000,000: 3s. Last read position: II:11,122,617. Last read name: A00987:53:HLKMJDRXX:2:1201:8874:20431
INFO 2020-09-02 20:59:37 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:59:39 MarkDuplicates Read 30,000,000 records. Elapsed time: 00:01:57s. Time for last 1,000,000: 2s. Last read position: V:19,878,808. Last read name: A00987:53:HLKMJDRXX:2:1214:21233:28494
INFO 2020-09-02 20:59:39 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 20:59:42 MarkDuplicates Read 31,000,000 records. Elapsed time: 00:02:00s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:2:1227:25012:4006
INFO 2020-09-02 20:59:42 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:45 MarkDuplicates Read 32,000,000 records. Elapsed time: 00:02:03s. Time for last 1,000,000: 2s. Last read position: X:16,334,136. Last read name: A00987:53:HLKMJDRXX:2:1240:31991:9987
INFO 2020-09-02 20:59:45 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:48 MarkDuplicates Read 33,000,000 records. Elapsed time: 00:02:06s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:2:1254:6867:5682
INFO 2020-09-02 20:59:48 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 20:59:51 MarkDuplicates Read 34,000,000 records. Elapsed time: 00:02:09s. Time for last 1,000,000: 2s. Last read position: II:1,300,008. Last read name: A00987:53:HLKMJDRXX:2:1267:7509:8860
INFO 2020-09-02 20:59:51 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:01 MarkDuplicates Read 35,000,000 records. Elapsed time: 00:02:19s. Time for last 1,000,000: 9s. Last read position: IV:7,401,409. Last read name: A00987:53:HLKMJDRXX:2:2102:23972:4085
INFO 2020-09-02 21:00:01 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 21:00:04 MarkDuplicates Read 36,000,000 records. Elapsed time: 00:02:22s. Time for last 1,000,000: 2s. Last read position: III:10,225,368. Last read name: A00987:53:HLKMJDRXX:2:2117:11686:35853
INFO 2020-09-02 21:00:04 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:07 MarkDuplicates Read 37,000,000 records. Elapsed time: 00:02:25s. Time for last 1,000,000: 2s. Last read position: X:12,551,293. Last read name: A00987:53:HLKMJDRXX:2:2132:13132:14403
INFO 2020-09-02 21:00:07 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 21:00:10 MarkDuplicates Read 38,000,000 records. Elapsed time: 00:02:27s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:2:2147:21513:4648
INFO 2020-09-02 21:00:10 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:13 MarkDuplicates Read 39,000,000 records. Elapsed time: 00:02:30s. Time for last 1,000,000: 2s. Last read position: IV:601,530. Last read name: A00987:53:HLKMJDRXX:2:2162:4878:4272
INFO 2020-09-02 21:00:13 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:16 MarkDuplicates Read 40,000,000 records. Elapsed time: 00:02:33s. Time for last 1,000,000: 2s. Last read position: IV:16,184,593. Last read name: A00987:53:HLKMJDRXX:2:2176:3748:32597
INFO 2020-09-02 21:00:16 MarkDuplicates Tracking 1 as yet unmatched pairs. 1 records in RAM.
INFO 2020-09-02 21:00:19 MarkDuplicates Read 41,000,000 records. Elapsed time: 00:02:36s. Time for last 1,000,000: 2s. Last read position: X:17,334,102. Last read name: A00987:53:HLKMJDRXX:2:2212:5999:33646
INFO 2020-09-02 21:00:19 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:21 MarkDuplicates Read 42,000,000 records. Elapsed time: 00:02:39s. Time for last 1,000,000: 2s. Last read position: I:7,023,334. Last read name: A00987:53:HLKMJDRXX:2:2226:32687:19617
INFO 2020-09-02 21:00:21 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:24 MarkDuplicates Read 43,000,000 records. Elapsed time: 00:02:42s. Time for last 1,000,000: 2s. Last read position: X:3,011,110. Last read name: A00987:53:HLKMJDRXX:2:2242:20537:5995
INFO 2020-09-02 21:00:24 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:28 MarkDuplicates Read 44,000,000 records. Elapsed time: 00:02:46s. Time for last 1,000,000: 4s. Last read position: III:7,428,654. Last read name: A00987:53:HLKMJDRXX:2:2257:18756:14967
INFO 2020-09-02 21:00:28 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:31 MarkDuplicates Read 45,000,000 records. Elapsed time: 00:02:49s. Time for last 1,000,000: 2s. Last read position: */*. Last read name: A00987:53:HLKMJDRXX:2:2271:25672:20682
INFO 2020-09-02 21:00:31 MarkDuplicates Tracking 0 as yet unmatched pairs. 0 records in RAM.
INFO 2020-09-02 21:00:33 MarkDuplicates Read 45528707 records. 0 pairs never matched.
INFO 2020-09-02 21:00:49 MarkDuplicates After buildSortedReadEndLists freeMemory: 2058495896; totalMemory: 2075918336; maxMemory: 2075918336
INFO 2020-09-02 21:00:49 MarkDuplicates Will retain up to 64872448 duplicate indices before spilling to disk.
INFO 2020-09-02 21:00:49 MarkDuplicates Traversing read pair information and detecting duplicates.
INFO 2020-09-02 21:00:55 MarkDuplicates Traversing fragment information and detecting duplicates.
INFO 2020-09-02 21:01:03 MarkDuplicates Sorting list of duplicate records.
INFO 2020-09-02 21:01:03 MarkDuplicates After generateDuplicateIndexes freeMemory: 1539430992; totalMemory: 2075918336; maxMemory: 2075918336
INFO 2020-09-02 21:01:03 MarkDuplicates Marking 914823 records as duplicates.
INFO 2020-09-02 21:01:03 MarkDuplicates Found 89855 optical duplicate clusters.
INFO 2020-09-02 21:01:03 MarkDuplicates Reads are assumed to be ordered by: queryname
[Wed Sep 02 21:01:03 MDT 2020] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 3.36 minutes.
Runtime.totalMemory()=2075918336
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Alignments added out of order in SAMFileWriterImpl.addAlignment for file:///home/foo/wrappers/alignment/bwa/MTG325.sorted.dedupped.bam. Sort order is queryname. Offending records are at [A00987:53:HLKMJDRXX:1:1101:1018:7326] and [A00987:53:HLKMJDRXX:1:1101:1018:14278]
at htsjdk.samtools.SAMFileWriterImpl.assertPresorted(SAMFileWriterImpl.java:197)
at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:184)
at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:425)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:305)
at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
Also getting this error from trying to run picard AddOrReplaceReadGroups
: https://github.com/zlskidmore/docker-picard/issues/2
are you sure that the input files are, in fact, queryname sorted? do they pass Picard's ValidateSamFile?
Looks like I am getting this error too, but with MarkIlluminaAdapters using unmapped BAMs.
Looks like I am getting this error too, but with MarkIlluminaAdapters using unmapped BAMs.
I was able to get around this error by renaming my reads like "000000001", "000000002", "000000003", etc.
Old issue, but got here from a google search. I got this error as well, also in MarkDuplicates. I solved with a different workaround. For some reason, lexicographical query sort (samtools sort -N) passes ValidateSamFile and works in MarkDuplicates, but alpha-numeric query sort (samtools sort -n) does not. I liked this workaround a lot better than re-naming the reads. FWIW, in 2020, the lexicographical sort may not even have been implemented yet.
Bug Report
Affected tool(s)
MarkDuplicates
Affected version(s)
Description
With SAM input having records clustered by QNAME (per the MarkDuplicates documentation) but not sorted as strings, duplicate analysis proceeds successfully but the program then halts with a java exception:
Steps to reproduce
This can occur in two straightforward use cases:
@HD SO=unsorted GO=query
.@HD SO=queryname
; the above exception illustrates this.Expected behavior
Program should run to completion
Actual behavior
Program throws a java exception
Additional comments
This problem could probably by hacked out by not doing any sort-order validation when SAM output is generated, but a more principled solution would be to perform only the validation required to support MarkDuplicates' unduplication functionality and to ignore the global sort order of the input (or, at most, issue a warning if an inconsistency were detected).
In the worst case, of course, MarkDuplicates can continue to accept only QNAME-ordered input as generated by the Picard SortSam tool, but a requirement that the input be re-sorted in order to satisfy MarkDuplicates becomes increasingly burdensome as data set sizes increase.