arogozhnikov / demuxalot

Reliable, scalable, efficient demultiplexing for single-cell RNA sequencing
MIT License
24 stars 3 forks source link

How do I define which BAM tags contain the UMI / cell barcode? #23

Closed mschilli87 closed 9 months ago

mschilli87 commented 9 months ago

I understand the by default Demuxalot assumes the UMI to be in the UB BAM tag and the cell barcode in the CB BAM tag as they are in cellranger output. I have a BAM input file using XM and XC instead.

I guess that by passing tag="XC" when creating the BarcodeHandler, I can take care of one of those but I have not found any way to specify an alternative to the UB default.

Could anyone please give me a pointer on how to make Demuxalot run on my BAM file?

Thank you in advance for your help.

arogozhnikov commented 9 months ago

Hi @mschilli87

you're correct about replacing CB -> XC.

As for replacing "UB", I need to start with an explanation. UB is a cellranger-specific tag, and demuxalot also uses alignment scoring tag from cellranger (to drop poor alignments. Usually mapq is used for this purpose, but some terrible alignments can have maximal mapq).

All cellranger-specific logic is encapsulated in a single function here: https://github.com/herophilus/demuxalot/blob/master/demuxalot/cellranger_specific.py

You can pass an alternative implementation of parse_read to all functions. Or just edit existing parse_read in-place. LMK if you need any guidance on this

mschilli87 commented 9 months ago

@arogozhnikov: Thx for your response. I think a simple generic parse_read implementation with a UMI tag parameter (if you want, defaulting to "UB") would be nice to have. Would you be willing to provide one as part of demuxalot? I could compose a PR if that helps but I'd appreciate any form or guidance/feedback.

arogozhnikov commented 9 months ago

@mschilli87 This is a good generic starting point:

from pysam import AlignedRead
from typing import Optional, Tuple

from demuxalot.utils import hash_string

class CustomReadFilter:
    def __init__(self, umi_tag='UB', min_mapq=20):
        self.umi_tag = umi_tag
        self.min_mapq = min_mapq

    def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
        if not read.has_tag(self.umi_tag):
            # does not have molecule barcode
            return None
        if read.mapq < self.min_mapq:
            # this one should not be triggered because of NH, but just in case
            return None

        p_misaligned = 0.01  # default value
        ub = hash_string(read.get_tag("UB"))
        return p_misaligned, ub

my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument

PS. I'd still recommend checking if there are some available tags that can help to discard non-reliable alignments

mschilli87 commented 9 months ago

@arogozhnikov: Thx, but shouldn't it be

-        ub = hash_string(read.get_tag("UB"))
+        ub = hash_string(read.get_tag(self.umi_tag))

?

I'll check the available BAM flags regarding p_misaligned.

Would you be up for including such a class in a demuxalot release, so other projects could make use of it without the need to (re-)implement it there?

mschilli87 commented 9 months ago

My files do have the AS and NH tags. So I guess I could go with

from pysam import AlignedRead
from typing import Optional, Tuple

from demuxalot.utils import hash_string

class CustomReadFilter:
    def __init__(self, umi_tag='UB', min_mapq=20):
        self.umi_tag = umi_tag
        self.min_mapq = min_mapq

    def __call__(self, read: AlignedRead) -> Optional[Tuple[float, int]]:
        if read.get_tag("AS") <= len(read.seq) - 8:
            # more than 2 edits
            return None
        if read.get_tag("NH") > 1:
            # multi-mapped
            return None
        if not read.has_tag(self.umi_tag):
            # does not have molecule barcode
            return None
        if read.mapq < self.min_mapq:
            # this one should not be triggered because of NH, but just in case
            return None

        p_misaligned = 0.01  # default value
        ub = hash_string(read.get_tag(self.umi_tag))
        return p_misaligned, ub

my_parse_read = CustomReadFilter(umi_tag="XM") # pass this as a parse_read argument

@arogozhnikov: Does this look good to you? If so, is there any reason not to have this be the default implementation rather than hard-coding the UMI tag to UB?

mschilli87 commented 9 months ago

I guess what I am trying to pitch here is

def parse_read(read: AlignedRead, umi_tag="UB") -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    if read.get_tag("AS") <= len(read.seq) - 8:
        # more than 2 edits
        return None
    if read.get_tag("NH") > 1:
        # multi-mapped
        return None
    if not read.has_tag(umi_tag):
        # does not have molecule barcode
        return None
    if read.mapq < 20:
        # this one should not be triggered because of NH, but just in case
        return None

    p_misaligned = 0.01  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub

Wouldn't this keep working the same as the current implementation with the added benefit of allowing to change the UMI tag without the need to copy/paste all that boilerplate code needed for the custom class?

arogozhnikov commented 9 months ago

shouldn't it be

sure, you're correct

My files do have the AS and NH tags ... Does this look good to you? ...

if you have illumina-style reads, this should work. AS/NH are likely to be produced by STAR aligner internally, and I'd expect their meaning is the same as in cellranger.

If you have long reads, number of tolerated errors should be somewhat proportional to length.

I guess what I am trying to pitch here is ...

I guess for user it would be easier to select across available configs, rather than across flags, e.g. cellranger_short, cellranger_pacbio, pipseeker_short etc - so that parse_read could use right setting and the right tags... That's the perfect case, but this unlikely to happen.

Your suggestion would work for STAR-aligned short reads, which is probably enough. So back to your question - PR with either setup is welcome, just make sure it actually works on your data.

mschilli87 commented 9 months ago

Yes, that's exactly my situtation: drop-seq derived short reads sequenced with Illumina and mapped with STAR. I'll fork the repo, make the changes, test them out and open a PR once I confirm it works on my data. I'll leave this issue open until the PR is in.

mschilli87 commented 9 months ago

@arogozhnikov: While working on (or rather testing, actually) my PR I encountered an issue not with the UB tag parameter I am adding, but with the already existing CB one:

While BarcodeHandler.__init__ has the tag parameter I need to adjust to run Demuxalot on my data, BarcodeHandler.from_file does not and returns BarcodeHandler(barcodes) without any chance to modify that flag.

As my Python is a bit rusty, I figured I check with you I got this right before trying to hack around more. My best guess would be to add a kwargs argument to the static function and pass it down to the __init__ function.

Any feedback would be appreciated.

arogozhnikov commented 9 months ago

good catch, that's why things need to be tested 😉

My best guess would be to add a kwargs argument to the static function and pass it down to the init function.

yes, that's the best approach

mschilli87 commented 9 months ago

@arogozhnikov: I have implemented everything and opened three PRs:

  1. 24, addressing the UB part only,

  2. 25, addressing the CB part only, and

  3. 26, combining both of the above for your convenience

I have tested #26

  1. with cellranger-style data and a script not passing any of the new arguments and verified that is works exactly as on current master, and
  2. with my non-cellranger-style data and an adjusted script passing the corresponding parameters to BarcodeHandler.from_file and count_snps, respectively and verified that it runs and yield results in reasonable agreement with running vireo on the same dataset.

If you were to merge those changes and include them in a release, it would greatly simplify my workflow.

Thank you for your guidance so far and you feedback yet to come.

arogozhnikov commented 9 months ago

awesome, thank you! Please see my remark on #24

arogozhnikov commented 9 months ago

thank you @mschilli87 we just sorted out things with repo, and I've merged both your PRs 🎉

If you have time to make a separate version of this https://github.com/arogozhnikov/demuxalot/blob/master/examples/1-plain_demultiplexing.py

for custom tags, I think that would help others with non-cellranger tags

mschilli87 commented 9 months ago

@arogozhnikov: Thank you. I'll try to make some time to add an example as per your suggestion. Is there any chance once I have done so those changes will make it into a (Pypi) release anytime soon? I have drneavin/Demultiplexing_Doublet_Detecting_Docs#48 open which depends on that new API which would significantly simplify deployment for me.

arogozhnikov commented 9 months ago

sure, it should be already live on pypi:

pip install demuxalot==0.4.1
mschilli87 commented 9 months ago

@arogozhnikov: Perfect. Thank you! I opened PR #30 as per your suggestion.