broadinstitute / picard

A set of command line tools (in Java) for manipulating high-throughput sequencing (HTS) data and formats such as SAM/BAM/CRAM and VCF.
https://broadinstitute.github.io/picard/
MIT License
983 stars 368 forks source link

Picard MarkDuplicates with dovetail reads #416

Closed Phillip-a-richmond closed 8 years ago

Phillip-a-richmond commented 8 years ago

Hello, I'm running Picard's MarkDuplicates tool on my paired-end exome sequencing data. The tool is definitely removing duplicate reads (1-4% of the library total), but it is having trouble with reads that are "dovetailed", or where the R1 and R2 overlap.

With paired-end exome sequencing we expect these dovetail cases, but I'm not fond of them having the duplicates, because PCR-based errors are then manifested and look like real variants (supported by reads on both strands nonetheless).

My understanding is that R1/R2 pairs are removed if the 5'-ends of both reads are the same as another pair in the same file.

So I have 2 questions:

  1. Is that how the remove duplicates works?
  2. Is there a reason these aren't being removed?

Attached are some screenshots from IGV screen shot 2016-01-26 at 2 55 05 pm screen shot 2016-01-26 at 2 54 52 pm

to help

yfarjoun commented 8 years ago

Can you create a tiny bam/sam, with the 6 reads that manifests this problem?

\0. MarkDuplicates doesn't "remove" reads (by default), it simply marks the reads. Are you sure you have your settings right in IGV so that it doesn't show duplicate reads?

  1. Yes (though also looks at orientation information, though that's irrelevant in this case as they are "innies")
  2. No. They should be removed unless there's some clipping I don't see in your screenshot (that's why I want to see the bam)
Phillip-a-richmond commented 8 years ago

So there definitely is no soft clipping. I've attached the reads in SAM format. It could be a function of the "innies" or "outies"? The reads that are highlighted in the picture are the ones that exist in the file.

Thanks, -Phil

On Thu, Jan 28, 2016 at 2:42 PM, Phillip Richmond < phillip.a.richmond@gmail.com> wrote:

We actually have the REMOVE_DUPLICATES=true option set.

I'm asking about whether or not I can share a few reads (clinical standards and whatnot). I'd imagine that's the case.

This is my command:

/opt/tools/jdk1.7.0_79/bin/java -jar /opt/tools/picard-tools-1.139/picard.jar MarkDuplicates I=$WORKING_DIR$SAMPLE_ID'_bowtie2.sorted.bam' O=$WORKING_DIR$SAMPLE_ID'_bowtie2_dupremoved.sorted.bam' REMOVE_DUPLICATES=true M=$WORKING_DIR$SAMPLE_ID'_bowtie2_DuplicateResults.txt'

This is the output metrics file:

htsjdk.samtools.metrics.StringHeader

picard.sam.markduplicates.MarkDuplicates

INPUT=[/mnt/data/Process/G017/G017-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G017/G017-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G017/G017-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json

htsjdk.samtools.metrics.StringHeader

Started on: Thu Dec 03 03:50:21 PST 2015

METRICS CLASS picard.sam.DuplicationMetrics

LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE 0 0 0 0 0 0 ? Unknown Library 403550 34602694 0 166028 0 0 0.002385

-Phil

On Tue, Jan 26, 2016 at 6:35 PM, Yossi Farjoun notifications@github.com wrote:

Can you create a tiny bam/sam, with the 6 reads that manifests this problem?

  1. MarkDuplicates doesn't "remove" reads (by default), it simply marks the reads. Are you sure you have your settings right in IGV so that it doesn't show duplicate reads?
  2. Yes (though also looks at orientation information, though that's irrelevant in this case as they are "innies")
  3. No. They should be removed unless there's some clipping I don't see in your screenshot (that's why I want to see the bam)

— Reply to this email directly or view it on GitHub https://github.com/broadinstitute/picard/issues/416#issuecomment-175354177 .

yfarjoun commented 8 years ago

github doesn't take attachments. can you post this online somewhere? drop-box, gist, etc?

Phillip-a-richmond commented 8 years ago

It's a short file, I can just paste it here:

@HD VN:1.5 GO:none SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @SQ SN:chr1_gl000191_random LN:106433 @SQ SN:chr1_gl000192_random LN:547496 @SQ SN:chr4_gl000193_random LN:189789 @SQ SN:chr4_gl000194_random LN:191469 @SQ SN:chr7_gl000195_random LN:182896 @SQ SN:chr8_gl000196_random LN:38914 @SQ SN:chr8_gl000197_random LN:37175 @SQ SN:chr9_gl000198_random LN:90085 @SQ SN:chr9_gl000199_random LN:169874 @SQ SN:chr9_gl000200_random LN:187035 @SQ SN:chr9_gl000201_random LN:36148 @SQ SN:chr11_gl000202_random LN:40103 @SQ SN:chr17_gl000203_random LN:37498 @SQ SN:chr17_gl000204_random LN:81310 @SQ SN:chr17_gl000205_random LN:174588 @SQ SN:chr17_gl000206_random LN:41001 @SQ SN:chr18_gl000207_random LN:4262 @SQ SN:chr19_gl000208_random LN:92689 @SQ SN:chr19_gl000209_random LN:159169 @SQ SN:chr21_gl000210_random LN:27682 @SQ SN:chrUn_gl000211 LN:166566 @SQ SN:chrUn_gl000212 LN:186858 @SQ SN:chrUn_gl000213 LN:164239 @SQ SN:chrUn_gl000214 LN:137718 @SQ SN:chrUn_gl000215 LN:172545 @SQ SN:chrUn_gl000216 LN:172294 @SQ SN:chrUn_gl000217 LN:172149 @SQ SN:chrUn_gl000218 LN:161147 @SQ SN:chrUn_gl000219 LN:179198 @SQ SN:chrUn_gl000220 LN:161802 @SQ SN:chrUn_gl000221 LN:155397 @SQ SN:chrUn_gl000222 LN:186861 @SQ SN:chrUn_gl000223 LN:180455 @SQ SN:chrUn_gl000224 LN:179693 @SQ SN:chrUn_gl000225 LN:211173 @SQ SN:chrUn_gl000226 LN:15008 @SQ SN:chrUn_gl000227 LN:128374 @SQ SN:chrUn_gl000228 LN:129120 @SQ SN:chrUn_gl000229 LN:19913 @SQ SN:chrUn_gl000230 LN:43691 @SQ SN:chrUn_gl000231 LN:27386 @SQ SN:chrUn_gl000232 LN:40652 @SQ SN:chrUn_gl000233 LN:45941 @SQ SN:chrUn_gl000234 LN:40531 @SQ SN:chrUn_gl000235 LN:34474 @SQ SN:chrUn_gl000236 LN:41934 @SQ SN:chrUn_gl000237 LN:45867 @SQ SN:chrUn_gl000238 LN:39939 @SQ SN:chrUn_gl000239 LN:33824 @SQ SN:chrUn_gl000240 LN:41933 @SQ SN:chrUn_gl000241 LN:42152 @SQ SN:chrUn_gl000242 LN:43523 @SQ SN:chrUn_gl000243 LN:43341 @SQ SN:chrUn_gl000244 LN:39929 @SQ SN:chrUn_gl000245 LN:36651 @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @RG ID:G0-1 SM:G0-1 @PG ID:MarkDuplicates VN:1.139(8ceee52414e8ab9d13e350ff9cd86d48825dd64d_1442240108) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/mnt/data/Process/G0/G0-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G0/G0-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G0/G0-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bowtie2 PN:bowtie2 VN:2.2.6 CL:"/opt/tools/bowtie2-2.2.6/bowtie2-align-s --wrapper basic-0 -x /mnt/data/GENOMES/hg19/hg19 -S /mnt/data/Process/G0/G0-1_bowtie2.sam -p 16 --very-sensitive -X 1000 --met-stderr --rg-id G0-1 --rg SM:G021-1 -1 /mnt/data/Process/G0/G-0-1_R1_chastitypassed.fastq -2 /mnt/data/Process/G0/G-0-1_R2_chastitypassed.fastq" @PG ID:GATK IndelRealigner VN:3.4-46-gbc02625 CL:knownAlleles=[] targetIntervals=/mnt/data/Process/G021/G021-1_bowtie2_indelsites.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null HS22_154:7:1310:19496:8739/2 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFEGGGGGGFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427/2 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGFGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:2309:20164:53168/2 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG BBBB?DCDGGGGGGGGGGG>FFCGGGGFC@CCG

EGGG@GGGGGGGGGGGGGGGGGGGG1FGGGGEDG01CFGEGGGGGGGGGGGGGGGGGGGFGGGGFEFGGEGGGGGGGGBG0FGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1310:19496:8739/1 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGGGFBEGGGGGGGGGGEGEGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGGCGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427/1 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:2309:20164:53168/1 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGCGGGGEGEGGBD0GF0GGCGEGF:@GGGGF@<1GGGF>F=FGGGGGGGGGGFGGGGGGEGGGGGGEGFDEGGFFFGGGGGGGGGDGGGEGGGGGGGGGGGEGGGGGGEGGGGGGGGGBCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP

On Fri, Jan 29, 2016 at 1:23 PM, Yossi Farjoun notifications@github.com wrote:

github doesn't take attachments. can you post this online somewhere? drop-box, gist, etc?

— Reply to this email directly or view it on GitHub https://github.com/broadinstitute/picard/issues/416#issuecomment-176974917 .

yfarjoun commented 8 years ago

Hi @Phillip-a-richmond

I took a look at your sam records. First of all, the SAM file doesn't validate (using ValidateSamFile) for two reasons:

  1. The @RG fieid is incorrect
  2. The readnames have a /1 /2 at their end, and so they are all "unpaired", though marked as paired.

After fixing both of these issues manually on the records you gave me, things looks fine. Below please find the output of MarkDuplictes on the manually doctored SAM:

@HD VN:1.5  GO:none SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10    LN:135534747
@SQ SN:chr11    LN:135006516
@SQ SN:chr12    LN:133851895
@SQ SN:chr13    LN:115169878
@SQ SN:chr14    LN:107349540
@SQ SN:chr15    LN:102531392
@SQ SN:chr16    LN:90354753
@SQ SN:chr17    LN:81195210
@SQ SN:chr18    LN:78077248
@SQ SN:chr19    LN:59128983
@SQ SN:chr20    LN:63025520
@SQ SN:chr21    LN:48129895
@SQ SN:chr22    LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@SQ SN:chrM LN:16571
@SQ SN:chr1_gl000191_random LN:106433
@SQ SN:chr1_gl000192_random LN:547496
@SQ SN:chr4_gl000193_random LN:189789
@SQ SN:chr4_gl000194_random LN:191469
@SQ SN:chr7_gl000195_random LN:182896
@SQ SN:chr8_gl000196_random LN:38914
@SQ SN:chr8_gl000197_random LN:37175
@SQ SN:chr9_gl000198_random LN:90085
@SQ SN:chr9_gl000199_random LN:169874
@SQ SN:chr9_gl000200_random LN:187035
@SQ SN:chr9_gl000201_random LN:36148
@SQ SN:chr11_gl000202_random    LN:40103
@SQ SN:chr17_gl000203_random    LN:37498
@SQ SN:chr17_gl000204_random    LN:81310
@SQ SN:chr17_gl000205_random    LN:174588
@SQ SN:chr17_gl000206_random    LN:41001
@SQ SN:chr18_gl000207_random    LN:4262
@SQ SN:chr19_gl000208_random    LN:92689
@SQ SN:chr19_gl000209_random    LN:159169
@SQ SN:chr21_gl000210_random    LN:27682
@SQ SN:chrUn_gl000211   LN:166566
@SQ SN:chrUn_gl000212   LN:186858
@SQ SN:chrUn_gl000213   LN:164239
@SQ SN:chrUn_gl000214   LN:137718
@SQ SN:chrUn_gl000215   LN:172545
@SQ SN:chrUn_gl000216   LN:172294
@SQ SN:chrUn_gl000217   LN:172149
@SQ SN:chrUn_gl000218   LN:161147
@SQ SN:chrUn_gl000219   LN:179198
@SQ SN:chrUn_gl000220   LN:161802
@SQ SN:chrUn_gl000221   LN:155397
@SQ SN:chrUn_gl000222   LN:186861
@SQ SN:chrUn_gl000223   LN:180455
@SQ SN:chrUn_gl000224   LN:179693
@SQ SN:chrUn_gl000225   LN:211173
@SQ SN:chrUn_gl000226   LN:15008
@SQ SN:chrUn_gl000227   LN:128374
@SQ SN:chrUn_gl000228   LN:129120
@SQ SN:chrUn_gl000229   LN:19913
@SQ SN:chrUn_gl000230   LN:43691
@SQ SN:chrUn_gl000231   LN:27386
@SQ SN:chrUn_gl000232   LN:40652
@SQ SN:chrUn_gl000233   LN:45941
@SQ SN:chrUn_gl000234   LN:40531
@SQ SN:chrUn_gl000235   LN:34474
@SQ SN:chrUn_gl000236   LN:41934
@SQ SN:chrUn_gl000237   LN:45867
@SQ SN:chrUn_gl000238   LN:39939
@SQ SN:chrUn_gl000239   LN:33824
@SQ SN:chrUn_gl000240   LN:41933
@SQ SN:chrUn_gl000241   LN:42152
@SQ SN:chrUn_gl000242   LN:43523
@SQ SN:chrUn_gl000243   LN:43341
@SQ SN:chrUn_gl000244   LN:39929
@SQ SN:chrUn_gl000245   LN:36651
@SQ SN:chrUn_gl000246   LN:38154
@SQ SN:chrUn_gl000247   LN:36422
@SQ SN:chrUn_gl000248   LN:39786
@SQ SN:chrUn_gl000249   LN:38502
@RG ID:G0-1 SM:G0-1 PL:illumina
@RG ID:G021-1   SM:G0-1 PL:illumina
@PG ID:MarkDuplicates VN:1.139(8ceee52414e8ab9d13e350ff9cd86d48825dd64d_1442240108) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/mnt/data/Process/G0/G0-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G0/G0-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G0/G0-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json    PN:MarkDuplicates
@PG ID:bowtie2    PN:bowtie2    VN:2.2.6 CL:"/opt/tools/bowtie2-2.2.6/bowtie2-align-s --wrapper basic-0 -x /mnt/data/GENOMES/hg19/hg19 -S /mnt/data/Process/G0/G0-1_bowtie2.sam -p 16 --very-sensitive -X 1000 --met-stderr --rg-id G0-1 --rg SM:G021-1 -1 /mnt/data/Process/G0/G-0-1_R1_chastitypassed.fastq -2 /mnt/data/Process/G0/G-0-1_R2_chastitypassed.fastq"
@PG ID:GATK IndelRealigner    VN:3.4-46-gbc02625    CL:knownAlleles=[] targetIntervals=/mnt/data/Process/G021/G021-1_bowtie2_indelsites.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
@PG ID:MarkDuplicates   VN:2.0.1(524567f601de8e6274b322f6fbc6fd4daef218cc_1453655240)   CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/Users/farjoun/Documents/test_416.bam] OUTPUT=/Users/farjoun/Documents/test_416.dups_marked.sam METRICS_FILE=/Users/farjoun/Documents/test_416.dups_metrics.txt    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json   PN:MarkDuplicates   PP:MarkDuplicates
HS22_154:7:1310:19496:8739  1187    chr13   60686196    42  125M    =   60686229    158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG   CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFEGGGGGGFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFG   MD:Z:56C68  PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 YS:i:-5 YT:Z:CP
HS22_154:7:1312:14938:85427 163 chr13   60686196    42  125M    =   60686229    158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG   CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGFGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG   MD:Z:56C68  PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 YS:i:-5 YT:Z:COMPRESSION_LEVEL
HS22_154:7:2309:20164:53168 1187    chr13   60686196    42  125M    =   60686229    158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG   BBBB?DCDGGGGGGGGGGG>FFCGGGGFC@CCG>EGGG@GGGGGGGGGGGGGGGGGGGG1FGGGGEDG01CFGEGGGGGGGGGGGGGGGGGGGFGGGGFEFGGEGGGGGGGGBG0FGGGGGGGGG   MD:Z:56C68  PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 YS:i:-5 YT:Z:CP
HS22_154:7:1310:19496:8739  1107    chr13   60686229    42  125M    =   60686196    -158    CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG   GGGGGFBEGGGGGGGGGGEGEGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGGCGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB   MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 YS:i:-5 YT:Z:CP
HS22_154:7:1312:14938:85427 83  chr13   60686229    42  125M    =   60686196    -158    CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG   GCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC   MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 YS:i:-5 YT:Z:CP
HS22_154:7:2309:20164:53168 1107    chr13   60686229    42  125M    =   60686196    -158    CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG   GGGCGGGGEGEGGBD0GF0GGCGEGF:@GGGGF@<1GGGF>F=FGGGGGGGGGGFGGGGGGEGGGGGGEGFDEGGFFFGGGGGGGGGDGGGEGGGGGGGGGGGEGGGGGGEGGGGGGGGGBCCCC   MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0  NM:i:1  XM:i:1  XN:i:0  XO:i:0  AS:i:-5 YS:i:-5 YT:Z:C
Phillip-a-richmond commented 8 years ago

So the @RG line you mentioned was altered manually by me before sending to you, so that's not the problem, but the /1 /2 problem, is that something that I need to fix manually before mapping the reads on my raw fastq file? I mean I'm using the -1 -2 in my bowtie2 command. So they're being mapped as pairs but not recognized as pairs?

Thanks, Phil

On Sun, Jan 31, 2016 at 3:43 PM, Yossi Farjoun notifications@github.com wrote:

Hi @Phillip-a-richmond https://github.com/Phillip-a-richmond

I took a look at your sam records. First of all, the SAM file doesn't validate (using ValidateSamFile) for two reasons:

  1. The @RG https://github.com/RG fieid is incorrect
  2. The readnames have a /1 /2 at their end, and so they are all "unpaired", though marked as paired.

After fixing both of these issues manually on the records you gave me, things looks fine. Below please find the output of MarkDuplictes on the manually doctored SAM:

@HD VN:1.5 GO:none SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @SQ SN:chr1_gl000191_random LN:106433 @SQ SN:chr1_gl000192_random LN:547496 @SQ SN:chr4_gl000193_random LN:189789 @SQ SN:chr4_gl000194_random LN:191469 @SQ SN:chr7_gl000195_random LN:182896 @SQ SN:chr8_gl000196_random LN:38914 @SQ SN:chr8_gl000197_random LN:37175 @SQ SN:chr9_gl000198_random LN:90085 @SQ SN:chr9_gl000199_random LN:169874 @SQ SN:chr9_gl000200_random LN:187035 @SQ SN:chr9_gl000201_random LN:36148 @SQ SN:chr11_gl000202_random LN:40103 @SQ SN:chr17_gl000203_random LN:37498 @SQ SN:chr17_gl000204_random LN:81310 @SQ SN:chr17_gl000205_random LN:174588 @SQ SN:chr17_gl000206_random LN:41001 @SQ SN:chr18_gl000207_random LN:4262 @SQ SN:chr19_gl000208_random LN:92689 @SQ SN:chr19_gl000209_random LN:159169 @SQ SN:chr21_gl000210_random LN:27682 @SQ SN:chrUn_gl000211 LN:166566 @SQ SN:chrUn_gl000212 LN:186858 @SQ SN:chrUn_gl000213 LN:164239 @SQ SN:chrUn_gl000214 LN:137718 @SQ SN:chrUn_gl000215 LN:172545 @SQ SN:chrUn_gl000216 LN:172294 @SQ SN:chrUn_gl000217 LN:172149 @SQ SN:chrUn_gl000218 LN:161147 @SQ SN:chrUn_gl000219 LN:179198 @SQ SN:chrUn_gl000220 LN:161802 @SQ SN:chrUn_gl000221 LN:155397 @SQ SN:chrUn_gl000222 LN:186861 @SQ SN:chrUn_gl000223 LN:180455 @SQ SN:chrUn_gl000224 LN:179693 @SQ SN:chrUn_gl000225 LN:211173 @SQ SN:chrUn_gl000226 LN:15008 @SQ SN:chrUn_gl000227 LN:128374 @SQ SN:chrUn_gl000228 LN:129120 @SQ SN:chrUn_gl000229 LN:19913 @SQ SN:chrUn_gl000230 LN:43691 @SQ SN:chrUn_gl000231 LN:27386 @SQ SN:chrUn_gl000232 LN:40652 @SQ SN:chrUn_gl000233 LN:45941 @SQ SN:chrUn_gl000234 LN:40531 @SQ SN:chrUn_gl000235 LN:34474 @SQ SN:chrUn_gl000236 LN:41934 @SQ SN:chrUn_gl000237 LN:45867 @SQ SN:chrUn_gl000238 LN:39939 @SQ SN:chrUn_gl000239 LN:33824 @SQ SN:chrUn_gl000240 LN:41933 @SQ SN:chrUn_gl000241 LN:42152 @SQ SN:chrUn_gl000242 LN:43523 @SQ SN:chrUn_gl000243 LN:43341 @SQ SN:chrUn_gl000244 LN:39929 @SQ SN:chrUn_gl000245 LN:36651 @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @RG ID:G0-1 SM:G0-1 PL:illumina @RG ID:G021-1 SM:G0-1 PL:illumina @PG ID:MarkDuplicates VN:1.139(8ceee52414e8ab9d13e350ff9cd86d48825dd64d_1442240108) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/mnt/data/Process/G0/G0-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G0/G0-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G0/G0-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bowtie2 PN:bowtie2 VN:2.2.6 CL:"/opt/tools/bowtie2-2.2.6/bowtie2-align-s --wrapper basic-0 -x /mnt/data/GENOMES/hg19/hg19 -S /mnt/data/Process/G0/G0-1_bowtie2.sam -p 16 --very-sensitive -X 1000 --met-stderr --rg-id G0-1 --rg SM:G021-1 -1 /mnt/data/Process/G0/G-0-1_R1_chastitypassed.fastq -2 /mnt/data/Process/G0/G-0-1_R2_chastitypassed.fastq" @PG ID:GATK IndelRealigner VN:3.4-46-gbc02625 CL:knownAlleles=[] targetIntervals=/mnt/data/Process/G021/G021-1_bowtie2_indelsites.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates VN:2.0.1(524567f601de8e6274b322f6fbc6fd4daef218cc_1453655240) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/Users/farjoun/Documents/test_416.bam] OUTPUT=/Users/farjoun/Documents/test_416.dups_marked.sam METRICS_FILE=/Users/farjoun/Documents/test_416.dups_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates PP:MarkDuplicates HS22_154:7:1310:19496:8739 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFEGGGGGGFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGFGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:COMPRESSION_LEVEL HS22_154:7:2309:20164:53168 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG BBBB?DCDGGGGGGGGGGG>FFCGGGGFC@CCG>EGGG@GGGGGGGGGGGGGGGGGGGG1FGGGGEDG01CFGEGGGGGGGGGGGGGGGGGGGFGGGGFEFGGEGGGGGGGGBG0FGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1310:19496:8739 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGGGFBEGGGGGGGGGGEGEGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGGCGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:2309:20164:53168 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGCGGGGEGEGGBD0GF0GGCGEGF:@GGGGF@<1GGGF>F=FGGGGGGGGGGFGGGGGGEGGGGGGEGFDEGGFFFGGGGGGGGGDGGGEGGGGGGGGGGGEGGGGGGEGGGGGGGGGBCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:C

— Reply to this email directly or view it on GitHub https://github.com/broadinstitute/picard/issues/416#issuecomment-177661808 .

yfarjoun commented 8 years ago

bowtie's release notes (http://bowtie-bio.sourceforge.net/bowtie2/news.shtml) indicate that "When input reads are unpaired, Bowtie 2 no longer removes the trailing /1 or /2 from the read name." So probably you are invoking bowtie in a way that it doesn't realize that the reads are paired.

good luck!

On Sun, Jan 31, 2016 at 6:59 PM, Phillip Richmond notifications@github.com wrote:

So the @RG line you mentioned was altered manually by me before sending to you, so that's not the problem, but the /1 /2 problem, is that something that I need to fix manually before mapping the reads on my raw fastq file? I mean I'm using the -1 -2 in my bowtie2 command. So they're being mapped as pairs but not recognized as pairs?

Thanks, Phil

On Sun, Jan 31, 2016 at 3:43 PM, Yossi Farjoun notifications@github.com wrote:

Hi @Phillip-a-richmond https://github.com/Phillip-a-richmond

I took a look at your sam records. First of all, the SAM file doesn't validate (using ValidateSamFile) for two reasons:

  1. The @RG https://github.com/RG fieid is incorrect
  2. The readnames have a /1 /2 at their end, and so they are all "unpaired", though marked as paired.

After fixing both of these issues manually on the records you gave me, things looks fine. Below please find the output of MarkDuplictes on the manually doctored SAM:

@HD VN:1.5 GO:none SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @SQ SN:chr1_gl000191_random LN:106433 @SQ SN:chr1_gl000192_random LN:547496 @SQ SN:chr4_gl000193_random LN:189789 @SQ SN:chr4_gl000194_random LN:191469 @SQ SN:chr7_gl000195_random LN:182896 @SQ SN:chr8_gl000196_random LN:38914 @SQ SN:chr8_gl000197_random LN:37175 @SQ SN:chr9_gl000198_random LN:90085 @SQ SN:chr9_gl000199_random LN:169874 @SQ SN:chr9_gl000200_random LN:187035 @SQ SN:chr9_gl000201_random LN:36148 @SQ SN:chr11_gl000202_random LN:40103 @SQ SN:chr17_gl000203_random LN:37498 @SQ SN:chr17_gl000204_random LN:81310 @SQ SN:chr17_gl000205_random LN:174588 @SQ SN:chr17_gl000206_random LN:41001 @SQ SN:chr18_gl000207_random LN:4262 @SQ SN:chr19_gl000208_random LN:92689 @SQ SN:chr19_gl000209_random LN:159169 @SQ SN:chr21_gl000210_random LN:27682 @SQ SN:chrUn_gl000211 LN:166566 @SQ SN:chrUn_gl000212 LN:186858 @SQ SN:chrUn_gl000213 LN:164239 @SQ SN:chrUn_gl000214 LN:137718 @SQ SN:chrUn_gl000215 LN:172545 @SQ SN:chrUn_gl000216 LN:172294 @SQ SN:chrUn_gl000217 LN:172149 @SQ SN:chrUn_gl000218 LN:161147 @SQ SN:chrUn_gl000219 LN:179198 @SQ SN:chrUn_gl000220 LN:161802 @SQ SN:chrUn_gl000221 LN:155397 @SQ SN:chrUn_gl000222 LN:186861 @SQ SN:chrUn_gl000223 LN:180455 @SQ SN:chrUn_gl000224 LN:179693 @SQ SN:chrUn_gl000225 LN:211173 @SQ SN:chrUn_gl000226 LN:15008 @SQ SN:chrUn_gl000227 LN:128374 @SQ SN:chrUn_gl000228 LN:129120 @SQ SN:chrUn_gl000229 LN:19913 @SQ SN:chrUn_gl000230 LN:43691 @SQ SN:chrUn_gl000231 LN:27386 @SQ SN:chrUn_gl000232 LN:40652 @SQ SN:chrUn_gl000233 LN:45941 @SQ SN:chrUn_gl000234 LN:40531 @SQ SN:chrUn_gl000235 LN:34474 @SQ SN:chrUn_gl000236 LN:41934 @SQ SN:chrUn_gl000237 LN:45867 @SQ SN:chrUn_gl000238 LN:39939 @SQ SN:chrUn_gl000239 LN:33824 @SQ SN:chrUn_gl000240 LN:41933 @SQ SN:chrUn_gl000241 LN:42152 @SQ SN:chrUn_gl000242 LN:43523 @SQ SN:chrUn_gl000243 LN:43341 @SQ SN:chrUn_gl000244 LN:39929 @SQ SN:chrUn_gl000245 LN:36651 @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @RG ID:G0-1 SM:G0-1 PL:illumina @RG ID:G021-1 SM:G0-1 PL:illumina @PG ID:MarkDuplicates VN:1.139(8ceee52414e8ab9d13e350ff9cd86d48825dd64d_1442240108) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/mnt/data/Process/G0/G0-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G0/G0-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G0/G0-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bowtie2 PN:bowtie2 VN:2.2.6 CL:"/opt/tools/bowtie2-2.2.6/bowtie2-align-s --wrapper basic-0 -x /mnt/data/GENOMES/hg19/hg19 -S /mnt/data/Process/G0/G0-1_bowtie2.sam -p 16 --very-sensitive -X 1000 --met-stderr --rg-id G0-1 --rg SM:G021-1 -1 /mnt/data/Process/G0/G-0-1_R1_chastitypassed.fastq -2 /mnt/data/Process/G0/G-0-1_R2_chastitypassed.fastq" @PG ID:GATK IndelRealigner VN:3.4-46-gbc02625 CL:knownAlleles=[] targetIntervals=/mnt/data/Process/G021/G021-1_bowtie2_indelsites.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates VN:2.0.1(524567f601de8e6274b322f6fbc6fd4daef218cc_1453655240) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/Users/farjoun/Documents/test_416.bam] OUTPUT=/Users/farjoun/Documents/test_416.dups_marked.sam METRICS_FILE=/Users/farjoun/Documents/test_416.dups_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates PP:MarkDuplicates HS22_154:7:1310:19496:8739 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFEGGGGGGFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGFGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:COMPRESSION_LEVEL HS22_154:7:2309:20164:53168 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG BBBB?DCDGGGGGGGGGGG>FFCGGGGFC@CCG EGGG@GGGGGGGGGGGGGGGGGGGG1FGGGGEDG01CFGEGGGGGGGGGGGGGGGGGGGFGGGGFEFGGEGGGGGGGGBG0FGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1310:19496:8739 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGGGFBEGGGGGGGGGGEGEGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGGCGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:2309:20164:53168 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGCGGGGEGEGGBD0GF0GGCGEGF:@GGGGF@<1GGGF>F=FGGGGGGGGGGFGGGGGGEGGGGGGEGFDEGGFFFGGGGGGGGGDGGGEGGGGGGGGGGGEGGGGGGEGGGGGGGGGBCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:C

— Reply to this email directly or view it on GitHub < https://github.com/broadinstitute/picard/issues/416#issuecomment-177661808

.

— Reply to this email directly or view it on GitHub https://github.com/broadinstitute/picard/issues/416#issuecomment-177663993 .