arogozhnikov / demuxalot

Reliable, scalable, efficient demultiplexing for single-cell RNA sequencing
MIT License
24 stars 3 forks source link

[solved] empty output after count_snps() with BD Rhapsody bam files #32

Open leqi0001 opened 3 months ago

leqi0001 commented 3 months ago

Hi,

I'm having trouble finding SNPs overlapping with reads using BAM files generated with BD Rhapsody pipeline. I specified UMI tag (MA) manually and cell barcode is the same (CB). However, the count_snps() step found 0 snps anywhere. It's not due to chromosomal naming or anything like that. The biggest difference is that these bam files have integers as cell barcodes instead of bases. Your suggestions are appreciated!

Le

My code is like this:

import os
from pathlib import Path
import pandas as pd
import pysam

from demuxalot.utils import download_file
from demuxalot import BarcodeHandler, ProbabilisticGenotypes, Demultiplexer, count_snps, detect_snps_positions
custom_celltag="CB"
handler = BarcodeHandler.from_file(barcode_filename, tag="CB")
genotype_names = ['C17','C32','C13','C11']
genotype_names.sort()
genotypes = ProbabilisticGenotypes(genotype_names=genotype_names)
genotypes.add_vcf(vcf_filename, prior_strength=100)
custom_umitag="MA"

from demuxalot.cellranger_specific import parse_read
parse_read_custom = lambda read: parse_read(read, umi_tag = "MA")
counts = count_snps(
    bamfile_location=bamfile_location,
    chromosome2positions=genotypes.get_chromosome2positions(),
    barcode_handler=handler,
    parse_read=parse_read_custom
)
arogozhnikov commented 3 months ago

Hi @leqi0001

As a next step, here is default parse_read that you use with custom UMI. Return None means this read will be ignored.

It is possible that some of filters (nhits tag maybe?) don't work for your data, in this case you should get rid of that check.

def parse_read(read: AlignedRead, umi_tag="UB", nhits_tag="NH", score_tag="AS",
               score_diff_max = 8, mapq_threshold = 20,
               # max. 2 edits --^
               p_misaligned_default = 0.01) -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    if read.get_tag(score_tag) <= len(read.seq) - score_diff_max:
        # too many edits
        return None
    if read.get_tag(nhits_tag) > 1:
        # multi-mapped
        return None
    if not read.has_tag(umi_tag):
        # does not have molecule barcode
        return None
    if read.mapq < mapq_threshold:
        # this one should not be triggered because of NH, but just in case
        return None

    p_misaligned = p_misaligned_default  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub
leqi0001 commented 3 months ago

Thanks for your reply!

I've checked the chromosomal names are consistent and the genotypes read are consistent with the vcf file I used. Parsed 5386856 SNPs, got 10773712 novel variants

I tried to deactivate all 4 filters like this:

from typing import Optional
from typing import Tuple
from pysam import AlignedRead
from demuxalot.utils import hash_string
def parse_read(read: AlignedRead, umi_tag="UB", nhits_tag="NH", score_tag="AS",
               score_diff_max = 8, mapq_threshold = 20,
               # max. 2 edits --^
               p_misaligned_default = 0.01) -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    p_misaligned = p_misaligned_default  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub

handler = BarcodeHandler.from_file(barcode_filename, tag="CB")
print(f'Loaded barcodes: {handler}')
##Loaded barcodes: <BarcodeHandler with 89149 barcodes>

parse_read_custom = lambda read: parse_read(read, umi_tag = "MR")
counts = count_snps(
    bamfile_location=bamfile_location,
    chromosome2positions=genotypes.get_chromosome2positions(),
    barcode_handler=handler,
    parse_read=parse_read_custom
)

The weird thing is that if I used the real filtered UMI tag "MA", it errors out saying KeyError: "tag 'MA' not present", even though MA is present in the bam file

image

If I use the raw UMI tag "MR", it can actually run but still output 0 SNPs.

image
arogozhnikov commented 3 months ago

The weird thing is that if I used the real filtered UMI tag "MA", it errors out saying KeyError: "tag 'MA' not present", even though MA is present in the bam file

agree, that's weird.

How about iterating over the BAM file and checking that we can read 'CB' tag?

import pysam

handler = BarcodeHandler.from_file(barcode_filename, tag="CB")

samfile = pysam.AlignmentFile("path-to-your.bam", "rb")
for i, read in enumerate(samfile.fetch('chr1', 100, 120)):  # select some good region
    if i > 1000: break
    print(i, handler.get_barcode_index(read))
leqi0001 commented 3 months ago

It does have trouble reading the barcodes

image

I tried a regular 10x bam file and it works fine

image

I double checked that the barcodes in my barcode files can be found in the bam file

pysam is able to recognize both CB and MA tags:

image

I noticed that barcode handler read the barcodes as int instead of str:

image
leqi0001 commented 3 months ago

Can confirm that if I force the barcodes to be read as strings, it works without any errors.

arogozhnikov commented 3 months ago

Glad you found the reason and solution, good job!

Didn't cross my mind that barcodes could be read as integer type.

If you have a parse_read version you like, we can add it as rhapsody_specific.py, I think it could be helpful for others

leqi0001 commented 3 months ago

Thank you so much for your help! Let me make sure the whole pipeline works before I post the code.

leqi0001 commented 3 months ago

I just created a pull request with small modifications. It's the first time I made a pull request. Hopefully I didn't mess up anything.

arogozhnikov commented 3 months ago

Suggested solution for rhapsody is to use parse_read from PR #33 that was just merged.

I've renamed the issue and let's keep it open so that others could find this.