ListerLab / pfc_development

14 stars 1 forks source link

full-raw count matrix #7

Closed haotian-zhuang closed 1 year ago

haotian-zhuang commented 1 year ago

Hi Chuck,

I noticed that "the default count matrix in the RNA anndata objects are the full-raw counts" from the website, but the dimension of the default matrix is 154748*26747, which is same to the dimension of the downsampled CPM counts. So, I would like to check if the default count matrix is produced after or before the downsampling step?

I download the data from this link. "https://storage.googleapis.com/neuro-dev/Processed_data/RNA-all_full-counts-and-downsampled-CPM.h5ad".

Thanks, Haotian

herrinca commented 1 year ago

Hi Haotain,

In the default full count matrix I have removed barcodes with fewer than 1,000 UMIs due to poor quality. The UMI counts have not been changed. Within my notebooks, this was done within the same function as downsampling the counts, which may have caused confusion.

Best, Chuck

haotian-zhuang commented 1 year ago

Hi Chuck,

Thanks, I see. Just to clarify, in your notebook, the genes was filtered from 28954 to 26747 (by min_cells = 5) after downsampling. So, if I downsample the default count matrix which contains 26747 genes, the result might not be exactly consistent to the downsampled CPM counts.

Thanks, Haotian

herrinca commented 1 year ago

That is correct. If you were to downsample the 26,747 genes, full count matrix it would not significantly change the results, but it would not be identical to the supplied downsampled CPM counts.

haotian-zhuang commented 1 year ago

Thanks Chuck. This reminds me the question of missing devDEG "RP11-452D21.1" in the default full count matrix. It might be filtered from 29030 genes to 26747 genes step, while the devDEGs are detected from the "real" raw count including 29030 genes.

herrinca commented 1 year ago

I believe you are correct. My guess is that is where it was filtered.....and yes the devDEGs were called on pseudo bulked by batch full counts.

haotian-zhuang commented 1 year ago

Thank you Chuck! Is there any place I can find the pseudo-bulked trajectory data in your website, or I have to process from the full count matrix?

herrinca commented 1 year ago

Yeah, you can find the logTMM bulk data for each trajectory in their glimma interactive directories. For the example in the L2/3_CUX2 glimma directory the logTMM_cts.csv is the scaled and normalized bulked by batch data used to call devDEGs in the paper. Each trajectory has their own logTMM_cts.csv.

haotian-zhuang commented 1 year ago

Thank you Chuck! Your answer is very helpful.