HorvathLab / NGS

Next-Gen Sequencing tools from the Horvath Lab
https://horvathlab.github.io/NGS/
MIT License
39 stars 16 forks source link

CR:Z tag in the bam file #9

Closed eltonjrv closed 2 years ago

eltonjrv commented 2 years ago

Dear SCExecute team,

Thanks for developing such tool, which appears to be quite useful for my current purpose (assessing several public scRNA-seq data). However, I'm encountering the following error message. ############################################### [193647] Failed to execute script scExecute Traceback (most recent call last): File "scExecute.py", line 514, in File "split.py", line 37, in iterator File "pysam/libcalignmentfile.pyx", line 742, in pysam.libcalignmentfile.AlignmentFile.cinit File "pysam/libcalignmentfile.pyx", line 991, in pysam.libcalignmentfile.AlignmentFile._open ValueError: file has no sequences defined (mode='rb') - is it SAM/BAM format? Consider opening with check_sq=False ###############################################

I've also noticed that the barcode tag in the bam file is actually CR:Z rather than CB, as the program seems to expect: scExecute Options: Read Files (-r): T06_TH_TOT_5GEX_1_S9.bam Read Groups (-G): CellRanger Description: Cell barcodes from the CB tag of aligned read - reads without a CB tag or with CB tag not in the accept list (default: file "barcodes.tsv" in the current directory) dropped. Specification: tag=CB acceptlist=barcodes.tsv

I hope you guys can shed a light to circumvent this. Thanks, Best, Elton

edwardsnj commented 2 years ago

CR is the "raw" cell-barcode, CB is the "cleaned-up" cell-barcode in Cell-Ranger. See: https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/output/bam

The Z just indicates a string value.

If you really want the CR tag, you can define a new CellRanger cell-barcode strategy:

[CellRangerCR_CB] Name: CellRanger-CR Description: Cell barcodes from the CR tag of aligned read - reads without a CR tag or with CR tag not in the accept list (default: file "barcodes.tsv" in the current directory) dropped. Type: CellBarcode ReadTagValue: tag='CR' acceptlist='barcodes.tsv'

in the file groups.ini in the current working directory.

However, all that said, it looks like you do not have a valid BAM file - the pysam module can't open it. Can you point me to where you got it from and how you prepared it for SCExecute?

Cheers!

edwardsnj commented 2 years ago

Here is the documentation for how scExecute handles barcodes...

https://horvathlab.github.io/NGS/SCExecute/docs/Barcodes.html

edwardsnj commented 2 years ago

I was able to find the BAM file online based on the filename you included in your issue. This BAM file has been processed by the tool fastq_pre_barcodes from the fastq_utils suite according to the BAM file header. I was also able to reproduce your error, so I should be able to offer more concrete advice shortly.

eltonjrv commented 2 years ago

Hi Edward, Thanks for your prompt support. Yes, it was indeed generated by fastq_pre_barcodes as we can see through "samtools view -H". Apologies for my naiveness as I've just started handling public scRNA-seq datasets. My previous work on scRNA-seq was only using FlyCellAtlas h5ad data, which can be easily loaded into Seurat for downstream analyses starting from the readCounts matrix. Well, I'll be avidly waiting for your next advice for this particular bam type. Thanks again, Best, Elton

edwardsnj commented 2 years ago

OK, what I've determined is that this BAM file is alignment free, it is just a convenient way to store the reads (and the barcodes). I can tweak the way that scExecute works to avoid the need for BAM files with alignments in them, but realistically, the tasks you want to run on the single-cell partitioned BAMs will probably require alignments. In short, get these reads (whether from the BAM or fastq files) aligned to a reference genome before using scExecute.

eltonjrv commented 2 years ago

Hey Nathan, Thanks a lot for clarifying. I've converted those bam to fastq (maintaining the barcode info), and then successfully ran STARsolo. Cheers, Elton

edwardsnj commented 2 years ago

I'm going to close this issue, which was really an issue with the BAM file. Nevertheless, I will make two changes to scExecute - first, I will tweak the code to permit unaligned BAM files as input, second, I will add a rule for fastq_pre_barcodes cell-barcode tags (CR) to the distribution.

Permitting unaligned BAM files is, most likely, not that useful, since the downstream analyses likely need aligned reads. But scExecute doesn't need them to be aligned itself, so it should not impose this restriction.

If you think there is more to discuss on this issue, please re-open it.

Cheers!