fulcrumgenomics / fgbio

Tools for working with genomic and high throughput sequencing data.
http://fulcrumgenomics.github.io/fgbio/
MIT License
309 stars 67 forks source link

GroupReadsByUmi duplicate marking may fail when secondary and supplementary alignments are included #961

Open nh13 opened 7 months ago

nh13 commented 7 months ago

See: https://github.com/samtools/hts-specs/issues/755

msto commented 3 months ago

Adding some color -

When attempting to mark duplicates in a BAM containing supplementary alignments, fgbio raises an exception. The exception appears to be because the primary alignment in the template was removed (or is possibly not sorted with the associated supplementary alignment?)

$ fgbio GroupReadsByUmi --strategy=Adjacency --input=input.bam --output=output.bam --mark-duplicates --include-supplementary=true
[2024/05/21 13:38:41 | FgBioMain | Info] Executing GroupReadsByUmi from fgbio version 2.2.1 as msto@Matts-MBP on JRE 22.0.1+8 with snappy, JdkInflater, and JdkDeflater
[2024/05/21 13:38:41 | GroupReadsByUmi | Info] Filtering the input.
[2024/05/21 13:38:41 | GroupReadsByUmi | Info] Sorting the input to TemplateCoordinate order.
[2024/05/21 13:38:41 | GroupReadsByUmi | Info] Seen many non-increasing record positions. Printing Read-names as well.
[2024/05/21 13:38:42 | GroupReadsByUmi | Info] Sorted       432,775 records.  Elapsed time: 00:00:01s.  Time for last 432,775:    1s.  Last read position: chr20:42,368,210.  Last read name: FS10002716:9:BTR99611-1426:1:1116:15810:4090
[2024/05/21 13:38:43 | GroupReadsByUmi | Info] Accepted 432,775 reads for grouping.
[2024/05/21 13:38:43 | GroupReadsByUmi | Info] Filtered out 604 reads due to mapping issues.
[2024/05/21 13:38:43 | GroupReadsByUmi | Info] Filtered out 0 reads that contained one or more Ns in their UMIs.
[2024/05/21 13:38:43 | GroupReadsByUmi | Info] Assigning reads to UMIs and outputting.
[2024/05/21 13:38:43 | FgBioMain | Info] GroupReadsByUmi failed. Elapsed time: 0.09 minutes.
Exception in thread "main" java.lang.IllegalStateException: FS10002716:9:BTR99611-1426:1:1103:7210:2350 did not have a primary R1 record.
        at com.fulcrumgenomics.umi.GroupReadsByUmi$ReadInfo$.$anonfun$apply$3(GroupReadsByUmi.scala:118)
        at scala.Option.getOrElse(Option.scala:201)
        at com.fulcrumgenomics.umi.GroupReadsByUmi$ReadInfo$.apply(GroupReadsByUmi.scala:118)
        at com.fulcrumgenomics.umi.GroupReadsByUmi.takeNextGroup(GroupReadsByUmi.scala:765)
        at com.fulcrumgenomics.umi.GroupReadsByUmi.execute(GroupReadsByUmi.scala:710)
        at com.fulcrumgenomics.cmdline.FgBioMain.makeItSo(FgBioMain.scala:124)
        at com.fulcrumgenomics.cmdline.FgBioMain.makeItSoAndExit(FgBioMain.scala:99)
        at com.fulcrumgenomics.cmdline.FgBioMain$.main(FgBioMain.scala:50)
        at com.fulcrumgenomics.cmdline.FgBioMain.main(FgBioMain.scala)

Setting --include-supplementary=False is sufficient to eliminate the exception, but I haven't examined the contents of the resulting BAM.

nh13 commented 3 months ago

@msto want to give https://github.com/fulcrumgenomics/fgbio/pull/964 a go?