File formats for --dump-eqclasses

COMBINE-lab / alevin-fry

🐟 🔬🦀 alevin-fry is an efficient and flexible tool for processing single-cell sequencing data, currently focused on single-cell transcriptomics and feature barcoding.

https://alevin-fry.readthedocs.io

BSD 3-Clause "New" or "Revised" License

169 stars 15 forks source link

File formats for --dump-eqclasses #112

Closed davidaknowles closed 1 year ago

davidaknowles commented 1 year ago

I'm trying to understand gene_eqclass.txt.gz and geqc_counts.mtx. I mostly get it from skimming quant.rs and the comments therein: gene_eqclass.txt.gz has a line giving the number of genes (or the number of USA targets), then a line giving the number of ECs. Then there is one line per EC, where the first n-1 entries are gene IDs, and the last entry is the EC idx (which is not the ordering in the file).

Then geqc_counts.mtx is cells x ECs, presumably with the row labels (cell barcodes being given by quants_mat_rows.txt. But what is the indexing for the columns, i.e. the ECs? Is that the EC idx (the last entry of each line in gene_eqclass.txt.gz) or the line number (-2) from gene_eqclass.txt.gz?

Thanks!

k3yavi commented 1 year ago

Hi @davidaknowles ,

Judging from this line and the way geqmap was populated here, I think it's the former, i.e., the indexing for the columns is the last entry of each line in gene_eqclass.txt.gz not the line number.

@rob-p can correct me if I am wrong.

davidaknowles commented 1 year ago

Thanks Avi. @rob-p enjoy RECOMB, no rush.

rob-p commented 1 year ago

Hi @davidaknowles,

Sorry for taking so long to get back here. @k3yavi's interpretation is correct. What is written in the MatrixMarket file for each triplet is the row id (i.e. cell barcode), column id (equivalence class id), and UMI count.

The equivalence class id is the unique, distinct id associated with each pattern of gene occurrences (or gene + splicing-status occurrences). For each equivalence class label in gene_eqclass.txt.gz, the equivalence class id is the last number of the line.

Best, Rob

davidaknowles commented 1 year ago

Perfect, thank you!