Open leqi0001 opened 3 months ago
Hi @leqi0001
genotypes.n_genotypes
and genotypes.n_variants
are correctAs a next step, here is default parse_read that you use with custom UMI. Return None means this read will be ignored.
It is possible that some of filters (nhits tag maybe?) don't work for your data, in this case you should get rid of that check.
def parse_read(read: AlignedRead, umi_tag="UB", nhits_tag="NH", score_tag="AS",
score_diff_max = 8, mapq_threshold = 20,
# max. 2 edits --^
p_misaligned_default = 0.01) -> Optional[Tuple[float, int]]:
"""
returns None if read should be ignored.
Read still can be ignored if it is not in the barcode list
"""
if read.get_tag(score_tag) <= len(read.seq) - score_diff_max:
# too many edits
return None
if read.get_tag(nhits_tag) > 1:
# multi-mapped
return None
if not read.has_tag(umi_tag):
# does not have molecule barcode
return None
if read.mapq < mapq_threshold:
# this one should not be triggered because of NH, but just in case
return None
p_misaligned = p_misaligned_default # default value
ub = hash_string(read.get_tag(umi_tag))
return p_misaligned, ub
Thanks for your reply!
I've checked the chromosomal names are consistent and the genotypes read are consistent with the vcf file I used. Parsed 5386856 SNPs, got 10773712 novel variants
I tried to deactivate all 4 filters like this:
from typing import Optional
from typing import Tuple
from pysam import AlignedRead
from demuxalot.utils import hash_string
def parse_read(read: AlignedRead, umi_tag="UB", nhits_tag="NH", score_tag="AS",
score_diff_max = 8, mapq_threshold = 20,
# max. 2 edits --^
p_misaligned_default = 0.01) -> Optional[Tuple[float, int]]:
"""
returns None if read should be ignored.
Read still can be ignored if it is not in the barcode list
"""
p_misaligned = p_misaligned_default # default value
ub = hash_string(read.get_tag(umi_tag))
return p_misaligned, ub
handler = BarcodeHandler.from_file(barcode_filename, tag="CB")
print(f'Loaded barcodes: {handler}')
##Loaded barcodes: <BarcodeHandler with 89149 barcodes>
parse_read_custom = lambda read: parse_read(read, umi_tag = "MR")
counts = count_snps(
bamfile_location=bamfile_location,
chromosome2positions=genotypes.get_chromosome2positions(),
barcode_handler=handler,
parse_read=parse_read_custom
)
The weird thing is that if I used the real filtered UMI tag "MA", it errors out saying KeyError: "tag 'MA' not present", even though MA is present in the bam file
If I use the raw UMI tag "MR", it can actually run but still output 0 SNPs.
The weird thing is that if I used the real filtered UMI tag "MA", it errors out saying KeyError: "tag 'MA' not present", even though MA is present in the bam file
agree, that's weird.
How about iterating over the BAM file and checking that we can read 'CB' tag?
import pysam
handler = BarcodeHandler.from_file(barcode_filename, tag="CB")
samfile = pysam.AlignmentFile("path-to-your.bam", "rb")
for i, read in enumerate(samfile.fetch('chr1', 100, 120)): # select some good region
if i > 1000: break
print(i, handler.get_barcode_index(read))
It does have trouble reading the barcodes
I tried a regular 10x bam file and it works fine
I double checked that the barcodes in my barcode files can be found in the bam file
pysam is able to recognize both CB and MA tags:
I noticed that barcode handler read the barcodes as int instead of str:
Can confirm that if I force the barcodes to be read as strings, it works without any errors.
Glad you found the reason and solution, good job!
Didn't cross my mind that barcodes could be read as integer type.
If you have a parse_read
version you like, we can add it as rhapsody_specific.py
, I think it could be helpful for others
Thank you so much for your help! Let me make sure the whole pipeline works before I post the code.
I just created a pull request with small modifications. It's the first time I made a pull request. Hopefully I didn't mess up anything.
Suggested solution for rhapsody is to use parse_read from PR #33 that was just merged.
I've renamed the issue and let's keep it open so that others could find this.
Hi,
I'm having trouble finding SNPs overlapping with reads using BAM files generated with BD Rhapsody pipeline. I specified UMI tag (MA) manually and cell barcode is the same (CB). However, the count_snps() step found 0 snps anywhere. It's not due to chromosomal naming or anything like that. The biggest difference is that these bam files have integers as cell barcodes instead of bases. Your suggestions are appreciated!
Le
My code is like this: