GWW / scsnv

scSNV Mapping tool for 10X Single Cell Data
MIT License
22 stars 4 forks source link

collapse using cellranger bam #16

Closed ttt-404 closed 1 year ago

ttt-404 commented 1 year ago

Hi,

Sorry to bother you again, I wonder if it’s possible that scSNV collapse works the same for Cell Ranger bam? I divided cells in different types based on the Cell Ranger expression profile, so I want SNV matrix with barcodes corrected by Cell Ranger (from Cell Ranger collapsed bam, rather than directly using pile up).

Thanks a lot if you would help!

GWW commented 1 year ago

Hi,

Cell Ranger bam files are not currently supported. There are certain alignment artifacts emitted by Cell Ranger that need to be corrected for accurate collapsing. If I have time in the future I may try to create a program to collapse reads in an aligner independent way. The SNV calls from scSNV should be fine to work with Cell Ranger gene expression data. The results from both tools is quite similar.

ttt-404 commented 1 year ago

Thank you for your timely reply! Actually, I find the barcode number from two tools has a bit difference (scSNV for 1422 cells and Cell Ranger for 1143 cells), is it feasible to directly extract barcodes that intersect with Cell Ranger from the scSNV pile up results? I think the read count or expression assigned to each barcode may be a little different, will that affect the cell type identication?

And also I got another question. I am kind of confused about the output from pileup_barcodes.txt, could you tell me what’s the difference between ‘bases_covered’ and ‘bases’ column? Thank you so much for your help!

barcode molecules bases_covered bases GATGAGGTCAGCGACC 70003 3409359 6160387 CCATTCGCAAAGAATC 67136 3340206 5578481

GWW commented 1 year ago

If you want to use a different barcode list you can just replace the passed_barcodes.txt.gz list with your own. It's just an optionally gzipped text file with the following format: a header line and then one barcode per line. The pileup command uses this list to choose which barcodes are piledup.

barcode
GAACGGACATCCAACA
ACTGAACAGGAGTAGA
ACAGCCGAGTTACGGG
TTGAACGCAATAGCAA
TCGGTAATCAAGCCTA

The output from the pileup_barcodes.txt command were mostly for debugging. The bases_covered is the total coverage across all of the bases in the sample and the bases is the total number of bases covered. I think the fields in the header are switched. I'll put that on my list of things to fix in my next update. For example, if you take your first line. There are 3.4M bases covered with 6.16M total coverage. IF you divide them you'll get that the average coverage across all of the covered bases is approximately 2x.

ttt-404 commented 1 year ago

I think I got what you mean, you really explained it in great detail. Thank you again for your time and patient reply!