broadinstitute / gatk

Official code repository for GATK versions 4 and up
https://software.broadinstitute.org/gatk
Other
1.69k stars 588 forks source link

Support for UMIs in PathSeq #6655

Open wir963 opened 4 years ago

wir963 commented 4 years ago

Feature request

Tool(s) or class(es) involved

PathSeq

Description

I would like the PathSeq scoring approach to be able to consider UMIs such as those used in scRNA sequencing experiments like 10x. UMIs are generally passed as a BAM tag so reads that share the same UMI should only be counted once.

mwalker174 commented 4 years ago

Thank you for your suggestion @wir963. This would certainly be a useful feature. If you are waiting for this capability, I think you could probably write a small pipeline to modify the current output:

  1. Merge UMI read tags into the output PathSeq BAM (involves traversing the input BAM)
  2. Filter duplicate UMI reads (perhaps retaining the one with the best alignment score)
  3. Split BAM into paired and unpaired read BAMs
  4. Feed these back into PathSeqScoreSpark

PathSeq currently removes all read tags - I have received requests in the past to fix this. I'm not sure when I'll have a chance to address this, but I will keep the ticket open since it's currently the only feature request.

wir963 commented 4 years ago

Thanks for your suggestion @mwalker174. I'd definitely like to do this soon so I'll implement that suggestion and update this thread with any questions.

wir963 commented 4 years ago

Hey @mwalker174 ,

Do you have any suggestions about how to perform step 1? I naively tried to use picard's MergeBamAlignment using the PathSeq output BAM as the aligned bam and the PathSeq input BAM as the unmapped BAM but I get the following error message

IllegalArgumentException: Do not use this function to merge dictionaries with different sequences in them. Sequences must be in the same order as well. Found [NZ_DS990135.1, NZ_AJSY01000035.1, ...

I tried sorting both BAM files by queryname and removing the alignment for the input BAM using RevertSam but neither of these worked. I suspect that it's because of the PathSeq output BAM given the references to the microbial sequences. Do you have any suggestions?

wir963 commented 4 years ago

FYI I'm just using pysam and doing the merge manually and it's working. I'll keep you posted