AllenInstitute / coupledAE-patchseq

Multimodal data alignment and cell type analysis with coupled autoencoders.
Other
8 stars 1 forks source link

scRNA-seq data processing #7

Closed big-rain closed 2 weeks ago

big-rain commented 3 weeks ago

Hi, While reading the data processing code you gave me, I noticed that you used some preprocessed files, such as data.feather, anno.feather, good_fenes_beta_score.csv, specimen_ids.txt; color_ref.csv, However, from the scRNA-seq download address you gave, the above files were not found, only gene counting moment, bam, fastp files exist.

I would like to know how to obtain this data by analyzing raw scRNA-seq data from the above files. Can you provide more detailed code to help me analyze it?

rhngla commented 2 weeks ago

1. data.feather + anno.feather

As you already know, the data.feather (expression values) and anno.feather (per-cell annotations, including cell type labels used in the paper) files were used to produce the .mat file included in the repository. These files roughly correspond to the count data and metadata files available here. I had included notebooks only for reference, not intended to be run by users 😁

2. beta scores + gene selection

The .mat file contains log-cpm normalized expression values for a pre-selected gene set. The set of genes was obtained using the beta score (a description of this score is can be found in the associated paper). I have now included beta scores assigned to each gene in genes_beta_score.csv in case that is of interest. I expect any reasonable strategy to reduce the set of genes (e.g. highly variable genes, differentially expressed genes etc.) would roughly provide similar results, and the methodology with coupled autoencoders is agnostic to that.

3. other links

Check this closed issue and Gouwens et al. 2020 for less processed versions of the data. This Allen Institute page may also be of interest to you.

big-rain commented 2 weeks ago

Can you provide working scRNA-seq data preprocessing code, from count matrices to .mat files?

rhngla commented 2 weeks ago

Hi - please use any standard pipeline (e.g. through scanpy) to process the raw data in the data and metadata .csv files linked above; you can add the exon and intron data to get a single counts matrix and then perform log-cpm normalization.