Closed mschilli87 closed 9 months ago
Hi @mschilli87
you're correct about replacing CB -> XC.
As for replacing "UB", I need to start with an explanation. UB is a cellranger-specific tag, and demuxalot also uses alignment scoring tag from cellranger (to drop poor alignments. Usually mapq is used for this purpose, but some terrible alignments can have maximal mapq).
All cellranger-specific logic is encapsulated in a single function here: https://github.com/herophilus/demuxalot/blob/master/demuxalot/cellranger_specific.py
You can pass an alternative implementation of parse_read to all functions. Or just edit existing parse_read
in-place. LMK if you need any guidance on this
@arogozhnikov: Thx for your response. I think a simple generic parse_read
implementation with a UMI tag parameter (if you want, defaulting to "UB"
) would be nice to have. Would you be willing to provide one as part of demuxalot? I could compose a PR if that helps but I'd appreciate any form or guidance/feedback.
@mschilli87 This is a good generic starting point:
from pysam import AlignedRead
from typing import Optional, Tuple
from demuxalot.utils import hash_string
class CustomReadFilter:
def __init__(self, umi_tag='UB', min_mapq=20):
self.umi_tag = umi_tag
self.min_mapq = min_mapq
def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
if not read.has_tag(self.umi_tag):
# does not have molecule barcode
return None
if read.mapq < self.min_mapq:
# this one should not be triggered because of NH, but just in case
return None
p_misaligned = 0.01 # default value
ub = hash_string(read.get_tag("UB"))
return p_misaligned, ub
my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument
PS. I'd still recommend checking if there are some available tags that can help to discard non-reliable alignments
@arogozhnikov: Thx, but shouldn't it be
- ub = hash_string(read.get_tag("UB"))
+ ub = hash_string(read.get_tag(self.umi_tag))
?
I'll check the available BAM flags regarding p_misaligned
.
Would you be up for including such a class in a demuxalot release, so other projects could make use of it without the need to (re-)implement it there?
My files do have the AS
and NH
tags. So I guess I could go with
from pysam import AlignedRead
from typing import Optional, Tuple
from demuxalot.utils import hash_string
class CustomReadFilter:
def __init__(self, umi_tag='UB', min_mapq=20):
self.umi_tag = umi_tag
self.min_mapq = min_mapq
def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
if read.get_tag("AS") <= len(read.seq) - 8:
# more than 2 edits
return None
if read.get_tag("NH") > 1:
# multi-mapped
return None
if not read.has_tag(self.umi_tag):
# does not have molecule barcode
return None
if read.mapq < self.min_mapq:
# this one should not be triggered because of NH, but just in case
return None
p_misaligned = 0.01 # default value
ub = hash_string(read.get_tag(self.umi_tag))
return p_misaligned, ub
my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument
@arogozhnikov: Does this look good to you? If so, is there any reason not to have this be the default implementation rather than hard-coding the UMI tag to UB
?
I guess what I am trying to pitch here is
def parse_read(read: AlignedRead, umi_tag="UB") -> Optional[Tuple[float, int]]:
"""
returns None if read should be ignored.
Read still can be ignored if it is not in the barcode list
"""
if read.get_tag("AS") <= len(read.seq) - 8:
# more than 2 edits
return None
if read.get_tag("NH") > 1:
# multi-mapped
return None
if not read.has_tag(umi_tag):
# does not have molecule barcode
return None
if read.mapq < 20:
# this one should not be triggered because of NH, but just in case
return None
p_misaligned = 0.01 # default value
ub = hash_string(read.get_tag(umi_tag))
return p_misaligned, ub
Wouldn't this keep working the same as the current implementation with the added benefit of allowing to change the UMI tag without the need to copy/paste all that boilerplate code needed for the custom class?
shouldn't it be
sure, you're correct
My files do have the AS and NH tags ... Does this look good to you? ...
if you have illumina-style reads, this should work. AS/NH are likely to be produced by STAR aligner internally, and I'd expect their meaning is the same as in cellranger.
If you have long reads, number of tolerated errors should be somewhat proportional to length.
I guess what I am trying to pitch here is ...
I guess for user it would be easier to select across available configs, rather than across flags, e.g. cellranger_short
, cellranger_pacbio
, pipseeker_short
etc - so that parse_read could use right setting and the right tags... That's the perfect case, but this unlikely to happen.
Your suggestion would work for STAR-aligned short reads, which is probably enough. So back to your question - PR with either setup is welcome, just make sure it actually works on your data.
Yes, that's exactly my situtation: drop-seq derived short reads sequenced with Illumina and mapped with STAR. I'll fork the repo, make the changes, test them out and open a PR once I confirm it works on my data. I'll leave this issue open until the PR is in.
@arogozhnikov: While working on (or rather testing, actually) my PR I encountered an issue not with the UB
tag parameter I am adding, but with the already existing CB
one:
While BarcodeHandler.__init__
has the tag
parameter I need to adjust to run Demuxalot on my data, BarcodeHandler.from_file
does not and returns BarcodeHandler(barcodes)
without any chance to modify that flag.
As my Python is a bit rusty, I figured I check with you I got this right before trying to hack around more. My best guess would be to add a kwargs
argument to the static function and pass it down to the __init__
function.
Any feedback would be appreciated.
good catch, that's why things need to be tested 😉
My best guess would be to add a kwargs argument to the static function and pass it down to the init function.
yes, that's the best approach
@arogozhnikov: I have implemented everything and opened three PRs:
UB
part only,CB
part only, andI have tested #26
master
, andBarcodeHandler.from_file
and count_snps
, respectively and verified that it runs and yield results in reasonable agreement with running vireo
on the same dataset.If you were to merge those changes and include them in a release, it would greatly simplify my workflow.
Thank you for your guidance so far and you feedback yet to come.
awesome, thank you! Please see my remark on #24
thank you @mschilli87 we just sorted out things with repo, and I've merged both your PRs 🎉
If you have time to make a separate version of this https://github.com/arogozhnikov/demuxalot/blob/master/examples/1-plain_demultiplexing.py
for custom tags, I think that would help others with non-cellranger tags
@arogozhnikov: Thank you. I'll try to make some time to add an example as per your suggestion. Is there any chance once I have done so those changes will make it into a (Pypi) release anytime soon? I have drneavin/Demultiplexing_Doublet_Detecting_Docs#48 open which depends on that new API which would significantly simplify deployment for me.
sure, it should be already live on pypi:
pip install demuxalot==0.4.1
@arogozhnikov: Perfect. Thank you! I opened PR #30 as per your suggestion.
I understand the by default Demuxalot assumes the UMI to be in the
UB
BAM tag and the cell barcode in theCB
BAM tag as they are in cellranger output. I have a BAM input file usingXM
andXC
instead.I guess that by passing
tag="XC"
when creating theBarcodeHandler
, I can take care of one of those but I have not found any way to specify an alternative to theUB
default.Could anyone please give me a pointer on how to make Demuxalot run on my BAM file?
Thank you in advance for your help.