Closed Phillip-a-richmond closed 8 years ago
Can you create a tiny bam/sam, with the 6 reads that manifests this problem?
\0. MarkDuplicates doesn't "remove" reads (by default), it simply marks the reads. Are you sure you have your settings right in IGV so that it doesn't show duplicate reads?
So there definitely is no soft clipping. I've attached the reads in SAM format. It could be a function of the "innies" or "outies"? The reads that are highlighted in the picture are the ones that exist in the file.
Thanks, -Phil
On Thu, Jan 28, 2016 at 2:42 PM, Phillip Richmond < phillip.a.richmond@gmail.com> wrote:
We actually have the REMOVE_DUPLICATES=true option set.
I'm asking about whether or not I can share a few reads (clinical standards and whatnot). I'd imagine that's the case.
This is my command:
/opt/tools/jdk1.7.0_79/bin/java -jar /opt/tools/picard-tools-1.139/picard.jar MarkDuplicates I=$WORKING_DIR$SAMPLE_ID'_bowtie2.sorted.bam' O=$WORKING_DIR$SAMPLE_ID'_bowtie2_dupremoved.sorted.bam' REMOVE_DUPLICATES=true M=$WORKING_DIR$SAMPLE_ID'_bowtie2_DuplicateResults.txt'
This is the output metrics file:
htsjdk.samtools.metrics.StringHeader
picard.sam.markduplicates.MarkDuplicates
INPUT=[/mnt/data/Process/G017/G017-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G017/G017-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G017/G017-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
htsjdk.samtools.metrics.StringHeader
Started on: Thu Dec 03 03:50:21 PST 2015
METRICS CLASS picard.sam.DuplicationMetrics
LIBRARY UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED UNMAPPED_READS UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES READ_PAIR_OPTICAL_DUPLICATES PERCENT_DUPLICATION ESTIMATED_LIBRARY_SIZE 0 0 0 0 0 0 ? Unknown Library 403550 34602694 0 166028 0 0 0.002385
-Phil
On Tue, Jan 26, 2016 at 6:35 PM, Yossi Farjoun notifications@github.com wrote:
Can you create a tiny bam/sam, with the 6 reads that manifests this problem?
- MarkDuplicates doesn't "remove" reads (by default), it simply marks the reads. Are you sure you have your settings right in IGV so that it doesn't show duplicate reads?
- Yes (though also looks at orientation information, though that's irrelevant in this case as they are "innies")
- No. They should be removed unless there's some clipping I don't see in your screenshot (that's why I want to see the bam)
— Reply to this email directly or view it on GitHub https://github.com/broadinstitute/picard/issues/416#issuecomment-175354177 .
github doesn't take attachments. can you post this online somewhere? drop-box, gist, etc?
It's a short file, I can just paste it here:
@HD VN:1.5 GO:none SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @SQ SN:chr1_gl000191_random LN:106433 @SQ SN:chr1_gl000192_random LN:547496 @SQ SN:chr4_gl000193_random LN:189789 @SQ SN:chr4_gl000194_random LN:191469 @SQ SN:chr7_gl000195_random LN:182896 @SQ SN:chr8_gl000196_random LN:38914 @SQ SN:chr8_gl000197_random LN:37175 @SQ SN:chr9_gl000198_random LN:90085 @SQ SN:chr9_gl000199_random LN:169874 @SQ SN:chr9_gl000200_random LN:187035 @SQ SN:chr9_gl000201_random LN:36148 @SQ SN:chr11_gl000202_random LN:40103 @SQ SN:chr17_gl000203_random LN:37498 @SQ SN:chr17_gl000204_random LN:81310 @SQ SN:chr17_gl000205_random LN:174588 @SQ SN:chr17_gl000206_random LN:41001 @SQ SN:chr18_gl000207_random LN:4262 @SQ SN:chr19_gl000208_random LN:92689 @SQ SN:chr19_gl000209_random LN:159169 @SQ SN:chr21_gl000210_random LN:27682 @SQ SN:chrUn_gl000211 LN:166566 @SQ SN:chrUn_gl000212 LN:186858 @SQ SN:chrUn_gl000213 LN:164239 @SQ SN:chrUn_gl000214 LN:137718 @SQ SN:chrUn_gl000215 LN:172545 @SQ SN:chrUn_gl000216 LN:172294 @SQ SN:chrUn_gl000217 LN:172149 @SQ SN:chrUn_gl000218 LN:161147 @SQ SN:chrUn_gl000219 LN:179198 @SQ SN:chrUn_gl000220 LN:161802 @SQ SN:chrUn_gl000221 LN:155397 @SQ SN:chrUn_gl000222 LN:186861 @SQ SN:chrUn_gl000223 LN:180455 @SQ SN:chrUn_gl000224 LN:179693 @SQ SN:chrUn_gl000225 LN:211173 @SQ SN:chrUn_gl000226 LN:15008 @SQ SN:chrUn_gl000227 LN:128374 @SQ SN:chrUn_gl000228 LN:129120 @SQ SN:chrUn_gl000229 LN:19913 @SQ SN:chrUn_gl000230 LN:43691 @SQ SN:chrUn_gl000231 LN:27386 @SQ SN:chrUn_gl000232 LN:40652 @SQ SN:chrUn_gl000233 LN:45941 @SQ SN:chrUn_gl000234 LN:40531 @SQ SN:chrUn_gl000235 LN:34474 @SQ SN:chrUn_gl000236 LN:41934 @SQ SN:chrUn_gl000237 LN:45867 @SQ SN:chrUn_gl000238 LN:39939 @SQ SN:chrUn_gl000239 LN:33824 @SQ SN:chrUn_gl000240 LN:41933 @SQ SN:chrUn_gl000241 LN:42152 @SQ SN:chrUn_gl000242 LN:43523 @SQ SN:chrUn_gl000243 LN:43341 @SQ SN:chrUn_gl000244 LN:39929 @SQ SN:chrUn_gl000245 LN:36651 @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @RG ID:G0-1 SM:G0-1 @PG ID:MarkDuplicates VN:1.139(8ceee52414e8ab9d13e350ff9cd86d48825dd64d_1442240108) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/mnt/data/Process/G0/G0-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G0/G0-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G0/G0-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bowtie2 PN:bowtie2 VN:2.2.6 CL:"/opt/tools/bowtie2-2.2.6/bowtie2-align-s --wrapper basic-0 -x /mnt/data/GENOMES/hg19/hg19 -S /mnt/data/Process/G0/G0-1_bowtie2.sam -p 16 --very-sensitive -X 1000 --met-stderr --rg-id G0-1 --rg SM:G021-1 -1 /mnt/data/Process/G0/G-0-1_R1_chastitypassed.fastq -2 /mnt/data/Process/G0/G-0-1_R2_chastitypassed.fastq" @PG ID:GATK IndelRealigner VN:3.4-46-gbc02625 CL:knownAlleles=[] targetIntervals=/mnt/data/Process/G021/G021-1_bowtie2_indelsites.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null HS22_154:7:1310:19496:8739/2 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFEGGGGGGFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427/2 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGFGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:2309:20164:53168/2 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG BBBB?DCDGGGGGGGGGGG>FFCGGGGFC@CCG
EGGG@GGGGGGGGGGGGGGGGGGGG1FGGGGEDG01CFGEGGGGGGGGGGGGGGGGGGGFGGGGFEFGGEGGGGGGGGBG0FGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1310:19496:8739/1 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGGGFBEGGGGGGGGGGEGEGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGGCGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427/1 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:2309:20164:53168/1 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGCGGGGEGEGGBD0GF0GGCGEGF:@GGGGF@<1GGGF>F=FGGGGGGGGGGFGGGGGGEGGGGGGEGFDEGGFFFGGGGGGGGGDGGGEGGGGGGGGGGGEGGGGGGEGGGGGGGGGBCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP
On Fri, Jan 29, 2016 at 1:23 PM, Yossi Farjoun notifications@github.com wrote:
github doesn't take attachments. can you post this online somewhere? drop-box, gist, etc?
— Reply to this email directly or view it on GitHub https://github.com/broadinstitute/picard/issues/416#issuecomment-176974917 .
Hi @Phillip-a-richmond
I took a look at your sam records. First of all, the SAM file doesn't validate (using ValidateSamFile) for two reasons:
After fixing both of these issues manually on the records you gave me, things looks fine. Below please find the output of MarkDuplictes on the manually doctored SAM:
@HD VN:1.5 GO:none SO:coordinate
@SQ SN:chr1 LN:249250621
@SQ SN:chr2 LN:243199373
@SQ SN:chr3 LN:198022430
@SQ SN:chr4 LN:191154276
@SQ SN:chr5 LN:180915260
@SQ SN:chr6 LN:171115067
@SQ SN:chr7 LN:159138663
@SQ SN:chr8 LN:146364022
@SQ SN:chr9 LN:141213431
@SQ SN:chr10 LN:135534747
@SQ SN:chr11 LN:135006516
@SQ SN:chr12 LN:133851895
@SQ SN:chr13 LN:115169878
@SQ SN:chr14 LN:107349540
@SQ SN:chr15 LN:102531392
@SQ SN:chr16 LN:90354753
@SQ SN:chr17 LN:81195210
@SQ SN:chr18 LN:78077248
@SQ SN:chr19 LN:59128983
@SQ SN:chr20 LN:63025520
@SQ SN:chr21 LN:48129895
@SQ SN:chr22 LN:51304566
@SQ SN:chrX LN:155270560
@SQ SN:chrY LN:59373566
@SQ SN:chrM LN:16571
@SQ SN:chr1_gl000191_random LN:106433
@SQ SN:chr1_gl000192_random LN:547496
@SQ SN:chr4_gl000193_random LN:189789
@SQ SN:chr4_gl000194_random LN:191469
@SQ SN:chr7_gl000195_random LN:182896
@SQ SN:chr8_gl000196_random LN:38914
@SQ SN:chr8_gl000197_random LN:37175
@SQ SN:chr9_gl000198_random LN:90085
@SQ SN:chr9_gl000199_random LN:169874
@SQ SN:chr9_gl000200_random LN:187035
@SQ SN:chr9_gl000201_random LN:36148
@SQ SN:chr11_gl000202_random LN:40103
@SQ SN:chr17_gl000203_random LN:37498
@SQ SN:chr17_gl000204_random LN:81310
@SQ SN:chr17_gl000205_random LN:174588
@SQ SN:chr17_gl000206_random LN:41001
@SQ SN:chr18_gl000207_random LN:4262
@SQ SN:chr19_gl000208_random LN:92689
@SQ SN:chr19_gl000209_random LN:159169
@SQ SN:chr21_gl000210_random LN:27682
@SQ SN:chrUn_gl000211 LN:166566
@SQ SN:chrUn_gl000212 LN:186858
@SQ SN:chrUn_gl000213 LN:164239
@SQ SN:chrUn_gl000214 LN:137718
@SQ SN:chrUn_gl000215 LN:172545
@SQ SN:chrUn_gl000216 LN:172294
@SQ SN:chrUn_gl000217 LN:172149
@SQ SN:chrUn_gl000218 LN:161147
@SQ SN:chrUn_gl000219 LN:179198
@SQ SN:chrUn_gl000220 LN:161802
@SQ SN:chrUn_gl000221 LN:155397
@SQ SN:chrUn_gl000222 LN:186861
@SQ SN:chrUn_gl000223 LN:180455
@SQ SN:chrUn_gl000224 LN:179693
@SQ SN:chrUn_gl000225 LN:211173
@SQ SN:chrUn_gl000226 LN:15008
@SQ SN:chrUn_gl000227 LN:128374
@SQ SN:chrUn_gl000228 LN:129120
@SQ SN:chrUn_gl000229 LN:19913
@SQ SN:chrUn_gl000230 LN:43691
@SQ SN:chrUn_gl000231 LN:27386
@SQ SN:chrUn_gl000232 LN:40652
@SQ SN:chrUn_gl000233 LN:45941
@SQ SN:chrUn_gl000234 LN:40531
@SQ SN:chrUn_gl000235 LN:34474
@SQ SN:chrUn_gl000236 LN:41934
@SQ SN:chrUn_gl000237 LN:45867
@SQ SN:chrUn_gl000238 LN:39939
@SQ SN:chrUn_gl000239 LN:33824
@SQ SN:chrUn_gl000240 LN:41933
@SQ SN:chrUn_gl000241 LN:42152
@SQ SN:chrUn_gl000242 LN:43523
@SQ SN:chrUn_gl000243 LN:43341
@SQ SN:chrUn_gl000244 LN:39929
@SQ SN:chrUn_gl000245 LN:36651
@SQ SN:chrUn_gl000246 LN:38154
@SQ SN:chrUn_gl000247 LN:36422
@SQ SN:chrUn_gl000248 LN:39786
@SQ SN:chrUn_gl000249 LN:38502
@RG ID:G0-1 SM:G0-1 PL:illumina
@RG ID:G021-1 SM:G0-1 PL:illumina
@PG ID:MarkDuplicates VN:1.139(8ceee52414e8ab9d13e350ff9cd86d48825dd64d_1442240108) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/mnt/data/Process/G0/G0-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G0/G0-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G0/G0-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates
@PG ID:bowtie2 PN:bowtie2 VN:2.2.6 CL:"/opt/tools/bowtie2-2.2.6/bowtie2-align-s --wrapper basic-0 -x /mnt/data/GENOMES/hg19/hg19 -S /mnt/data/Process/G0/G0-1_bowtie2.sam -p 16 --very-sensitive -X 1000 --met-stderr --rg-id G0-1 --rg SM:G021-1 -1 /mnt/data/Process/G0/G-0-1_R1_chastitypassed.fastq -2 /mnt/data/Process/G0/G-0-1_R2_chastitypassed.fastq"
@PG ID:GATK IndelRealigner VN:3.4-46-gbc02625 CL:knownAlleles=[] targetIntervals=/mnt/data/Process/G021/G021-1_bowtie2_indelsites.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null
@PG ID:MarkDuplicates VN:2.0.1(524567f601de8e6274b322f6fbc6fd4daef218cc_1453655240) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/Users/farjoun/Documents/test_416.bam] OUTPUT=/Users/farjoun/Documents/test_416.dups_marked.sam METRICS_FILE=/Users/farjoun/Documents/test_416.dups_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates PP:MarkDuplicates
HS22_154:7:1310:19496:8739 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFEGGGGGGFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP
HS22_154:7:1312:14938:85427 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGFGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:COMPRESSION_LEVEL
HS22_154:7:2309:20164:53168 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG BBBB?DCDGGGGGGGGGGG>FFCGGGGFC@CCG>EGGG@GGGGGGGGGGGGGGGGGGGG1FGGGGEDG01CFGEGGGGGGGGGGGGGGGGGGGFGGGGFEFGGEGGGGGGGGBG0FGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP
HS22_154:7:1310:19496:8739 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGGGFBEGGGGGGGGGGEGEGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGGCGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP
HS22_154:7:1312:14938:85427 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP
HS22_154:7:2309:20164:53168 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGCGGGGEGEGGBD0GF0GGCGEGF:@GGGGF@<1GGGF>F=FGGGGGGGGGGFGGGGGGEGGGGGGEGFDEGGFFFGGGGGGGGGDGGGEGGGGGGGGGGGEGGGGGGEGGGGGGGGGBCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:C
So the @RG line you mentioned was altered manually by me before sending to you, so that's not the problem, but the /1 /2 problem, is that something that I need to fix manually before mapping the reads on my raw fastq file? I mean I'm using the -1 -2 in my bowtie2 command. So they're being mapped as pairs but not recognized as pairs?
Thanks, Phil
On Sun, Jan 31, 2016 at 3:43 PM, Yossi Farjoun notifications@github.com wrote:
Hi @Phillip-a-richmond https://github.com/Phillip-a-richmond
I took a look at your sam records. First of all, the SAM file doesn't validate (using ValidateSamFile) for two reasons:
- The @RG https://github.com/RG fieid is incorrect
- The readnames have a /1 /2 at their end, and so they are all "unpaired", though marked as paired.
After fixing both of these issues manually on the records you gave me, things looks fine. Below please find the output of MarkDuplictes on the manually doctored SAM:
@HD VN:1.5 GO:none SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @SQ SN:chr1_gl000191_random LN:106433 @SQ SN:chr1_gl000192_random LN:547496 @SQ SN:chr4_gl000193_random LN:189789 @SQ SN:chr4_gl000194_random LN:191469 @SQ SN:chr7_gl000195_random LN:182896 @SQ SN:chr8_gl000196_random LN:38914 @SQ SN:chr8_gl000197_random LN:37175 @SQ SN:chr9_gl000198_random LN:90085 @SQ SN:chr9_gl000199_random LN:169874 @SQ SN:chr9_gl000200_random LN:187035 @SQ SN:chr9_gl000201_random LN:36148 @SQ SN:chr11_gl000202_random LN:40103 @SQ SN:chr17_gl000203_random LN:37498 @SQ SN:chr17_gl000204_random LN:81310 @SQ SN:chr17_gl000205_random LN:174588 @SQ SN:chr17_gl000206_random LN:41001 @SQ SN:chr18_gl000207_random LN:4262 @SQ SN:chr19_gl000208_random LN:92689 @SQ SN:chr19_gl000209_random LN:159169 @SQ SN:chr21_gl000210_random LN:27682 @SQ SN:chrUn_gl000211 LN:166566 @SQ SN:chrUn_gl000212 LN:186858 @SQ SN:chrUn_gl000213 LN:164239 @SQ SN:chrUn_gl000214 LN:137718 @SQ SN:chrUn_gl000215 LN:172545 @SQ SN:chrUn_gl000216 LN:172294 @SQ SN:chrUn_gl000217 LN:172149 @SQ SN:chrUn_gl000218 LN:161147 @SQ SN:chrUn_gl000219 LN:179198 @SQ SN:chrUn_gl000220 LN:161802 @SQ SN:chrUn_gl000221 LN:155397 @SQ SN:chrUn_gl000222 LN:186861 @SQ SN:chrUn_gl000223 LN:180455 @SQ SN:chrUn_gl000224 LN:179693 @SQ SN:chrUn_gl000225 LN:211173 @SQ SN:chrUn_gl000226 LN:15008 @SQ SN:chrUn_gl000227 LN:128374 @SQ SN:chrUn_gl000228 LN:129120 @SQ SN:chrUn_gl000229 LN:19913 @SQ SN:chrUn_gl000230 LN:43691 @SQ SN:chrUn_gl000231 LN:27386 @SQ SN:chrUn_gl000232 LN:40652 @SQ SN:chrUn_gl000233 LN:45941 @SQ SN:chrUn_gl000234 LN:40531 @SQ SN:chrUn_gl000235 LN:34474 @SQ SN:chrUn_gl000236 LN:41934 @SQ SN:chrUn_gl000237 LN:45867 @SQ SN:chrUn_gl000238 LN:39939 @SQ SN:chrUn_gl000239 LN:33824 @SQ SN:chrUn_gl000240 LN:41933 @SQ SN:chrUn_gl000241 LN:42152 @SQ SN:chrUn_gl000242 LN:43523 @SQ SN:chrUn_gl000243 LN:43341 @SQ SN:chrUn_gl000244 LN:39929 @SQ SN:chrUn_gl000245 LN:36651 @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @RG ID:G0-1 SM:G0-1 PL:illumina @RG ID:G021-1 SM:G0-1 PL:illumina @PG ID:MarkDuplicates VN:1.139(8ceee52414e8ab9d13e350ff9cd86d48825dd64d_1442240108) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/mnt/data/Process/G0/G0-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G0/G0-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G0/G0-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bowtie2 PN:bowtie2 VN:2.2.6 CL:"/opt/tools/bowtie2-2.2.6/bowtie2-align-s --wrapper basic-0 -x /mnt/data/GENOMES/hg19/hg19 -S /mnt/data/Process/G0/G0-1_bowtie2.sam -p 16 --very-sensitive -X 1000 --met-stderr --rg-id G0-1 --rg SM:G021-1 -1 /mnt/data/Process/G0/G-0-1_R1_chastitypassed.fastq -2 /mnt/data/Process/G0/G-0-1_R2_chastitypassed.fastq" @PG ID:GATK IndelRealigner VN:3.4-46-gbc02625 CL:knownAlleles=[] targetIntervals=/mnt/data/Process/G021/G021-1_bowtie2_indelsites.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates VN:2.0.1(524567f601de8e6274b322f6fbc6fd4daef218cc_1453655240) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/Users/farjoun/Documents/test_416.bam] OUTPUT=/Users/farjoun/Documents/test_416.dups_marked.sam METRICS_FILE=/Users/farjoun/Documents/test_416.dups_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates PP:MarkDuplicates HS22_154:7:1310:19496:8739 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFEGGGGGGFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGFGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:COMPRESSION_LEVEL HS22_154:7:2309:20164:53168 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG BBBB?DCDGGGGGGGGGGG>FFCGGGGFC@CCG>EGGG@GGGGGGGGGGGGGGGGGGGG1FGGGGEDG01CFGEGGGGGGGGGGGGGGGGGGGFGGGGFEFGGEGGGGGGGGBG0FGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1310:19496:8739 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGGGFBEGGGGGGGGGGEGEGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGGCGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:2309:20164:53168 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGCGGGGEGEGGBD0GF0GGCGEGF:@GGGGF@<1GGGF>F=FGGGGGGGGGGFGGGGGGEGGGGGGEGFDEGGFFFGGGGGGGGGDGGGEGGGGGGGGGGGEGGGGGGEGGGGGGGGGBCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:C
— Reply to this email directly or view it on GitHub https://github.com/broadinstitute/picard/issues/416#issuecomment-177661808 .
bowtie's release notes (http://bowtie-bio.sourceforge.net/bowtie2/news.shtml) indicate that "When input reads are unpaired, Bowtie 2 no longer removes the trailing /1 or /2 from the read name." So probably you are invoking bowtie in a way that it doesn't realize that the reads are paired.
good luck!
On Sun, Jan 31, 2016 at 6:59 PM, Phillip Richmond notifications@github.com wrote:
So the @RG line you mentioned was altered manually by me before sending to you, so that's not the problem, but the /1 /2 problem, is that something that I need to fix manually before mapping the reads on my raw fastq file? I mean I'm using the -1 -2 in my bowtie2 command. So they're being mapped as pairs but not recognized as pairs?
Thanks, Phil
On Sun, Jan 31, 2016 at 3:43 PM, Yossi Farjoun notifications@github.com wrote:
Hi @Phillip-a-richmond https://github.com/Phillip-a-richmond
I took a look at your sam records. First of all, the SAM file doesn't validate (using ValidateSamFile) for two reasons:
- The @RG https://github.com/RG fieid is incorrect
- The readnames have a /1 /2 at their end, and so they are all "unpaired", though marked as paired.
After fixing both of these issues manually on the records you gave me, things looks fine. Below please find the output of MarkDuplictes on the manually doctored SAM:
@HD VN:1.5 GO:none SO:coordinate @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 @SQ SN:chr4 LN:191154276 @SQ SN:chr5 LN:180915260 @SQ SN:chr6 LN:171115067 @SQ SN:chr7 LN:159138663 @SQ SN:chr8 LN:146364022 @SQ SN:chr9 LN:141213431 @SQ SN:chr10 LN:135534747 @SQ SN:chr11 LN:135006516 @SQ SN:chr12 LN:133851895 @SQ SN:chr13 LN:115169878 @SQ SN:chr14 LN:107349540 @SQ SN:chr15 LN:102531392 @SQ SN:chr16 LN:90354753 @SQ SN:chr17 LN:81195210 @SQ SN:chr18 LN:78077248 @SQ SN:chr19 LN:59128983 @SQ SN:chr20 LN:63025520 @SQ SN:chr21 LN:48129895 @SQ SN:chr22 LN:51304566 @SQ SN:chrX LN:155270560 @SQ SN:chrY LN:59373566 @SQ SN:chrM LN:16571 @SQ SN:chr1_gl000191_random LN:106433 @SQ SN:chr1_gl000192_random LN:547496 @SQ SN:chr4_gl000193_random LN:189789 @SQ SN:chr4_gl000194_random LN:191469 @SQ SN:chr7_gl000195_random LN:182896 @SQ SN:chr8_gl000196_random LN:38914 @SQ SN:chr8_gl000197_random LN:37175 @SQ SN:chr9_gl000198_random LN:90085 @SQ SN:chr9_gl000199_random LN:169874 @SQ SN:chr9_gl000200_random LN:187035 @SQ SN:chr9_gl000201_random LN:36148 @SQ SN:chr11_gl000202_random LN:40103 @SQ SN:chr17_gl000203_random LN:37498 @SQ SN:chr17_gl000204_random LN:81310 @SQ SN:chr17_gl000205_random LN:174588 @SQ SN:chr17_gl000206_random LN:41001 @SQ SN:chr18_gl000207_random LN:4262 @SQ SN:chr19_gl000208_random LN:92689 @SQ SN:chr19_gl000209_random LN:159169 @SQ SN:chr21_gl000210_random LN:27682 @SQ SN:chrUn_gl000211 LN:166566 @SQ SN:chrUn_gl000212 LN:186858 @SQ SN:chrUn_gl000213 LN:164239 @SQ SN:chrUn_gl000214 LN:137718 @SQ SN:chrUn_gl000215 LN:172545 @SQ SN:chrUn_gl000216 LN:172294 @SQ SN:chrUn_gl000217 LN:172149 @SQ SN:chrUn_gl000218 LN:161147 @SQ SN:chrUn_gl000219 LN:179198 @SQ SN:chrUn_gl000220 LN:161802 @SQ SN:chrUn_gl000221 LN:155397 @SQ SN:chrUn_gl000222 LN:186861 @SQ SN:chrUn_gl000223 LN:180455 @SQ SN:chrUn_gl000224 LN:179693 @SQ SN:chrUn_gl000225 LN:211173 @SQ SN:chrUn_gl000226 LN:15008 @SQ SN:chrUn_gl000227 LN:128374 @SQ SN:chrUn_gl000228 LN:129120 @SQ SN:chrUn_gl000229 LN:19913 @SQ SN:chrUn_gl000230 LN:43691 @SQ SN:chrUn_gl000231 LN:27386 @SQ SN:chrUn_gl000232 LN:40652 @SQ SN:chrUn_gl000233 LN:45941 @SQ SN:chrUn_gl000234 LN:40531 @SQ SN:chrUn_gl000235 LN:34474 @SQ SN:chrUn_gl000236 LN:41934 @SQ SN:chrUn_gl000237 LN:45867 @SQ SN:chrUn_gl000238 LN:39939 @SQ SN:chrUn_gl000239 LN:33824 @SQ SN:chrUn_gl000240 LN:41933 @SQ SN:chrUn_gl000241 LN:42152 @SQ SN:chrUn_gl000242 LN:43523 @SQ SN:chrUn_gl000243 LN:43341 @SQ SN:chrUn_gl000244 LN:39929 @SQ SN:chrUn_gl000245 LN:36651 @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @RG ID:G0-1 SM:G0-1 PL:illumina @RG ID:G021-1 SM:G0-1 PL:illumina @PG ID:MarkDuplicates VN:1.139(8ceee52414e8ab9d13e350ff9cd86d48825dd64d_1442240108) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/mnt/data/Process/G0/G0-1_bowtie2.sorted.bam] OUTPUT=/mnt/data/Process/G0/G0-1_bowtie2_dupremoved.sorted.bam METRICS_FILE=/mnt/data/Process/G0/G0-1_bowtie2_DuplicateResults.txt REMOVE_DUPLICATES=true MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates @PG ID:bowtie2 PN:bowtie2 VN:2.2.6 CL:"/opt/tools/bowtie2-2.2.6/bowtie2-align-s --wrapper basic-0 -x /mnt/data/GENOMES/hg19/hg19 -S /mnt/data/Process/G0/G0-1_bowtie2.sam -p 16 --very-sensitive -X 1000 --met-stderr --rg-id G0-1 --rg SM:G021-1 -1 /mnt/data/Process/G0/G-0-1_R1_chastitypassed.fastq -2 /mnt/data/Process/G0/G-0-1_R2_chastitypassed.fastq" @PG ID:GATK IndelRealigner VN:3.4-46-gbc02625 CL:knownAlleles=[] targetIntervals=/mnt/data/Process/G021/G021-1_bowtie2_indelsites.intervals LODThresholdForCleaning=5.0 consensusDeterminationModel=USE_READS entropyThreshold=0.15 maxReadsInMemory=150000 maxIsizeForMovement=3000 maxPositionalMoveAllowed=200 maxConsensuses=30 maxReadsForConsensuses=120 maxReadsForRealignment=20000 noOriginalAlignmentTags=false nWayOut=null generate_nWayOut_md5s=false check_early=false noPGTag=false keepPGTags=false indelsFileForDebugging=null statisticsFileForDebugging=null SNPsFileForDebugging=null @PG ID:MarkDuplicates VN:2.0.1(524567f601de8e6274b322f6fbc6fd4daef218cc_1453655240) CL:picard.sam.markduplicates.MarkDuplicates INPUT=[/Users/farjoun/Documents/test_416.bam] OUTPUT=/Users/farjoun/Documents/test_416.dups_marked.sam METRICS_FILE=/Users/farjoun/Documents/test_416.dups_metrics.txt MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json PN:MarkDuplicates PP:MarkDuplicates HS22_154:7:1310:19496:8739 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCBCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGFEGGGGGGFGGGGGGGGGGGGGGGGGGGFFGGGGGGGGGGGFGFGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGFG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427 163 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG CCCCCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCGGGGGGGGGFGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:COMPRESSION_LEVEL HS22_154:7:2309:20164:53168 1187 chr13 60686196 42 125M = 60686229 158 AAGTTCTCCATCATCTCTAAAGGTGCTGCTGAGCAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAG BBBB?DCDGGGGGGGGGGG>FFCGGGGFC@CCG EGGG@GGGGGGGGGGGGGGGGGGGG1FGGGGEDG01CFGEGGGGGGGGGGGGGGGGGGGFGGGGFEFGGEGGGGGGGGBG0FGGGGGGGGG MD:Z:56C68 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1310:19496:8739 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGGGFBEGGGGGGGGGGEGEGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGGGGGGGGGGCGGGGGGGEGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCB MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:1312:14938:85427 83 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGBGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGCCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:CP HS22_154:7:2309:20164:53168 1107 chr13 60686229 42 125M = 60686196 -158 CAATCACTGCTTGCAAATGCAGTATTCAGGTTGGGAAGTGGAGGTCTCTCTTTCTTGCTCCCTGGAATTCTTATGCTGGCAAATTTGTCCAGCTACAAAGAAAGACAGAATGACACACGTGTAAG GGGCGGGGEGEGGBD0GF0GGCGEGF:@GGGGF@<1GGGF>F=FGGGGGGGGGGFGGGGGGEGGGGGGEGFDEGGFFFGGGGGGGGGDGGGEGGGGGGGGGGGEGGGGGGEGGGGGGGGGBCCCC MD:Z:23C101 PG:Z:MarkDuplicates RG:Z:G021-1 XG:i:0 NM:i:1 XM:i:1 XN:i:0 XO:i:0 AS:i:-5 YS:i:-5 YT:Z:C
— Reply to this email directly or view it on GitHub < https://github.com/broadinstitute/picard/issues/416#issuecomment-177661808
.
— Reply to this email directly or view it on GitHub https://github.com/broadinstitute/picard/issues/416#issuecomment-177663993 .
Hello, I'm running Picard's MarkDuplicates tool on my paired-end exome sequencing data. The tool is definitely removing duplicate reads (1-4% of the library total), but it is having trouble with reads that are "dovetailed", or where the R1 and R2 overlap.
With paired-end exome sequencing we expect these dovetail cases, but I'm not fond of them having the duplicates, because PCR-based errors are then manifested and look like real variants (supported by reads on both strands nonetheless).
My understanding is that R1/R2 pairs are removed if the 5'-ends of both reads are the same as another pair in the same file.
So I have 2 questions:
Attached are some screenshots from IGV
to help