[solved] empty output after count_snps() with BD Rhapsody bam files

leqi0001 commented 3 months ago

Hi,

I'm having trouble finding SNPs overlapping with reads using BAM files generated with BD Rhapsody pipeline. I specified UMI tag (MA) manually and cell barcode is the same (CB). However, the count_snps() step found 0 snps anywhere. It's not due to chromosomal naming or anything like that. The biggest difference is that these bam files have integers as cell barcodes instead of bases. Your suggestions are appreciated!

Le

My code is like this:

import os
from pathlib import Path
import pandas as pd
import pysam

from demuxalot.utils import download_file
from demuxalot import BarcodeHandler, ProbabilisticGenotypes, Demultiplexer, count_snps, detect_snps_positions
custom_celltag="CB"
handler = BarcodeHandler.from_file(barcode_filename, tag="CB")
genotype_names = ['C17','C32','C13','C11']
genotype_names.sort()
genotypes = ProbabilisticGenotypes(genotype_names=genotype_names)
genotypes.add_vcf(vcf_filename, prior_strength=100)
custom_umitag="MA"

from demuxalot.cellranger_specific import parse_read
parse_read_custom = lambda read: parse_read(read, umi_tag = "MA")
counts = count_snps(
    bamfile_location=bamfile_location,
    chromosome2positions=genotypes.get_chromosome2positions(),
    barcode_handler=handler,
    parse_read=parse_read_custom
)

arogozhnikov commented 3 months ago

Hi @leqi0001

i'd still double check chromosome naming (chr_X and X and alike)
then confirm that genotypes.n_genotypes and genotypes.n_variants are correct
integers cell barcodes should be ok. For sure it won't result in zero calls.

As a next step, here is default parse_read that you use with custom UMI. Return None means this read will be ignored.

It is possible that some of filters (nhits tag maybe?) don't work for your data, in this case you should get rid of that check.

def parse_read(read: AlignedRead, umi_tag="UB", nhits_tag="NH", score_tag="AS",
               score_diff_max = 8, mapq_threshold = 20,
               # max. 2 edits --^
               p_misaligned_default = 0.01) -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    if read.get_tag(score_tag) <= len(read.seq) - score_diff_max:
        # too many edits
        return None
    if read.get_tag(nhits_tag) > 1:
        # multi-mapped
        return None
    if not read.has_tag(umi_tag):
        # does not have molecule barcode
        return None
    if read.mapq < mapq_threshold:
        # this one should not be triggered because of NH, but just in case
        return None

    p_misaligned = p_misaligned_default  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub

leqi0001 commented 3 months ago

Thanks for your reply!

I've checked the chromosomal names are consistent and the genotypes read are consistent with the vcf file I used. Parsed 5386856 SNPs, got 10773712 novel variants

I tried to deactivate all 4 filters like this:

from typing import Optional
from typing import Tuple
from pysam import AlignedRead
from demuxalot.utils import hash_string
def parse_read(read: AlignedRead, umi_tag="UB", nhits_tag="NH", score_tag="AS",
               score_diff_max = 8, mapq_threshold = 20,
               # max. 2 edits --^
               p_misaligned_default = 0.01) -> Optional[Tuple[float, int]]:
    """
    returns None if read should be ignored.
    Read still can be ignored if it is not in the barcode list
    """
    p_misaligned = p_misaligned_default  # default value
    ub = hash_string(read.get_tag(umi_tag))
    return p_misaligned, ub

handler = BarcodeHandler.from_file(barcode_filename, tag="CB")
print(f'Loaded barcodes: {handler}')
##Loaded barcodes: <BarcodeHandler with 89149 barcodes>

parse_read_custom = lambda read: parse_read(read, umi_tag = "MR")
counts = count_snps(
    bamfile_location=bamfile_location,
    chromosome2positions=genotypes.get_chromosome2positions(),
    barcode_handler=handler,
    parse_read=parse_read_custom
)

The weird thing is that if I used the real filtered UMI tag "MA", it errors out saying KeyError: "tag 'MA' not present", even though MA is present in the bam file

If I use the raw UMI tag "MR", it can actually run but still output 0 SNPs.

arogozhnikov commented 3 months ago

The weird thing is that if I used the real filtered UMI tag "MA", it errors out saying KeyError: "tag 'MA' not present", even though MA is present in the bam file

agree, that's weird.

How about iterating over the BAM file and checking that we can read 'CB' tag?

import pysam

handler = BarcodeHandler.from_file(barcode_filename, tag="CB")

samfile = pysam.AlignmentFile("path-to-your.bam", "rb")
for i, read in enumerate(samfile.fetch('chr1', 100, 120)):  # select some good region
    if i > 1000: break
    print(i, handler.get_barcode_index(read))

leqi0001 commented 3 months ago

It does have trouble reading the barcodes

I tried a regular 10x bam file and it works fine

I double checked that the barcodes in my barcode files can be found in the bam file

pysam is able to recognize both CB and MA tags:

I noticed that barcode handler read the barcodes as int instead of str:

leqi0001 commented 3 months ago

Can confirm that if I force the barcodes to be read as strings, it works without any errors.

arogozhnikov commented 3 months ago

Glad you found the reason and solution, good job!

Didn't cross my mind that barcodes could be read as integer type.

If you have a parse_read version you like, we can add it as rhapsody_specific.py, I think it could be helpful for others

leqi0001 commented 3 months ago

Thank you so much for your help! Let me make sure the whole pipeline works before I post the code.

leqi0001 commented 3 months ago

I just created a pull request with small modifications. It's the first time I made a pull request. Hopefully I didn't mess up anything.

arogozhnikov commented 3 months ago

Suggested solution for rhapsody is to use parse_read from PR #33 that was just merged.

I've renamed the issue and let's keep it open so that others could find this.

arogozhnikov / demuxalot

[solved] empty output after count_snps() with BD Rhapsody bam files #32

My code is like this: