brentp / smoove

structural variant calling and genotyping with existing tools, but, smoothly.
Apache License 2.0
222 stars 21 forks source link

panic: sam: duplicate program name: line 3430: "@PG\tID:SAMBLASTER\tVN:0.1.24\tCL:samblaster -i dummy.cram.readsorted.sam -o dummy.cram.samblster.sam --addMateTags" #197

Open AlecBayliff opened 2 years ago

AlecBayliff commented 2 years ago

[smoove] 2022/04/27 15:54:53 starting with version 0.2.8 panic: sam: duplicate program name: line 3430: "@PG\tID:SAMBLASTER\tVN:0.1.24\tCL:samblaster -i dummy.cram.readsorted.sam -o dummy.cram.samblster.sam --addMateTags"

goroutine 1 [running]: github.com/brentp/smoove/lumpy.check(...) /home/brentp/src/smoove/lumpy/lumpy.go:54 github.com/brentp/smoove/lumpy.lumpy_filter_cmd(0x7ffe572f8dc8, 0x11, 0x7ffe572f8dba, 0x3, 0x7ffe572f8d58, 0x58, 0x0, 0x0, 0x0, 0x0, ...) /home/brentp/src/smoove/lumpy/lumpy.go:83 +0xfc5 github.com/brentp/smoove/lumpy.Lumpy(0x7ffe572f8dc5, 0x2, 0x7ffe572f8d58, 0x58, 0x7ffe572f8dba, 0x3, 0xc00016a7b0, 0x1, 0x1, 0xc0000c79a0, ...) /home/brentp/src/smoove/lumpy/lumpy.go:133 +0x131 github.com/brentp/smoove/lumpy.Main() /home/brentp/src/smoove/lumpy/lumpy.go:351 +0x29b main.main() /home/brentp/src/smoove/cmd/smoove/smoove.go:121 +0x1c4

Running into this error when attempting to use smoove. Not quite sure how our data was generated, as it's far older than I've been working (and the generation process for this is new to me), but it appears to be related to SAMBLASTER. Any clues? I can submit a sample bam file via email if necessary.

brentp commented 2 years ago

can you show the header of that bam/cram excluding the @SN fields?

AlecBayliff commented 2 years ago

@HD VN:1.5 SO:coordinate @PG ID:2898894460 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898894506 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898894581 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898894717 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898959235 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898959389 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898959443 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898959663 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898959786 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898960658 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898960793 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898960945 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898961247 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898961350 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898961797 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898961979 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898961980 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898962153 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898962164 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:2898962566 PN:bwa CL:/usr/local/bin/bwa mem -K 100000000 -t 8 -Y -p -R @RG VN:0.7.15-r1140 @PG ID:MarkDuplicates PN:MarkDuplicates CL:picard.sam.markduplicates.MarkDuplicates INPUT=[NameSorted.bam] OUTPUT=/dev/stdout METRICS_FILE=mark_dups_metrics.txt ASSUME_SORT_ORDER=queryname QUIET=true VALIDATION_STRINGENCY=LENIENT COMPRESSION_LEVEL=0 MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates READ_NAME_REGEX=<optimized capture of last three ':' separated fields as numeric values> OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json VN:2.4.1() @PG ID:SAMBLASTER CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-1B8F8516 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-20A67EE1 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-2B71A7F7 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-439AB4F4 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-450A1372 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-48FFF10A CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-4A937C63 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-4B7BF877 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-566C21FA CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-5B497C48 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-61D1DB43 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-742E787B CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-7598406A CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-77A583B3 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-7AA4766C CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-7C99F9A4 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-9AE688F CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-A1CDAB6 CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:SAMBLASTER-C93BF CL:samblaster -i stdin -o stdout --acceptDupMarks --addMateTags VN:0.1.24 @PG ID:GATK PrintReads VN:3.6-0-gf185a75 CL:readGroup=null platform=null number=-1 sample_file=[] sample_name=[] simplify=false no_pg_tag=false @RG ID:2898894460 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF772CCXY.6.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898894506 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF772CCXY.5.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898894581 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF772CCXY.7.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898894717 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF772CCXY.8.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898959235 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HFFJTCCXY.5.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898959389 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HFFJTCCXY.1.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898959443 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HFFJTCCXY.4.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898959663 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HFFJTCCXY.3.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898959786 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HFFJTCCXY.6.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898960658 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HFFJTCCXY.7.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898960793 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HFFJTCCXY.2.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898960945 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HFFJTCCXY.8.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898961247 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF5GWCCXY.5.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898961350 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF5GWCCXY.4.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898961797 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF5GWCCXY.1.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898961979 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF5GWCCXY.8.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898961980 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF5GWCCXY.2.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898962153 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF5GWCCXY.7.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898962164 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF5GWCCXY.6.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @RG ID:2898962566 CN:WUGSC LB:H_ZK-11476752-lib1 PL:Illumina PU:HF5GWCCXY.3.ATACGGCG-ATACGGCG SM:H_ZK-11476752 @PG ID:SAMBLASTER VN:0.1.24 CL:samblaster -i H_ZK-11476752.cram.readsorted.sam -o H_ZK-11476752.cram.samblster.sam --addMateTags

brentp commented 2 years ago

Hi, you have 2 @PG lines with ID:SAMBLASTER. biogo/hts, which is used for parsing bam/sam/cram files in smoove is very strict about headers. You can run samtools reheader on your bam file to remove one of those lines and then smoove should work fine.

AlecBayliff commented 2 years ago

Thanks, this fixed the issue. At least, I was able to get it by converting from bam > sam > bam and removing the lines in the sam file. Are you aware of a better way to do this without conversion? samtools reheader -c "grep -v ^@PG" test.bam > reheader.bam is giving me an empty file.