czbiohub-sf / tabula-muris-senis

Tabula Muris Senis
http://tabula-muris-senis.ds.czbiohub.org
BSD 3-Clause "New" or "Revised" License
93 stars 26 forks source link

How was the droplet (and facs) data processed/normalized #19

Closed machlabd closed 3 years ago

machlabd commented 3 years ago

Hello,

Charlotte(@csoneson), Federico (@federicomarini) and I are trying to convert some of the h5ad files into objects to be read into R to be used in R/Bioconductor. We are particularly looking at these 2 files from the droplet data:

https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102?file=23938934 https://figshare.com/articles/dataset/Processed_files_to_use_with_scanpy_/8273102?file=23936684

Extracting the matrices of these files, it seems one has the raw counts (23938934) and the other (23936684) has some processed form of the counts. Could you elaborate on what kind of transformations and processing the counts underwent? We were interested in using this h5ad file (23936684) as it also contains the reduced dimensioanlities which would be nice to include with the raw data.

Thank you and best, Dania

aopisco commented 3 years ago

Hi all, @machlabd @csoneson @federicomarini, the data was log normalized to max 10e4 and then scaled (0,10). To convert from h5ad to Seurat the best is to go via loom but will lose the embeddings. I have not tried using this package but it seems it might be helpful here. Let me know if there's something I can help with it or if you find a nice solution please share it back!

machlabd commented 3 years ago

Hello, thank you for your prompt reply! So from what I understood, I've tried first scaling the counts in each cell to sum to 1e4(this corrects for the library sizes), then log-transforming, and then scaling to [0,10]. The resulting counts are still very different from the normalized counts present in the h5ad file. Did I miss a step or misunderstand?

So far we can load the count matrices from the loom files as DelayedArrays into R to avoid loading the whole matrix into memory, but the package you mention is indeed very helpful for getting the SingleCellExperiment (similar to AnnData in python) object in R to have all metadata and reduced dim representations available.

aopisco commented 3 years ago

@machlabd did you start with raw data?

machlabd commented 3 years ago

@aopisco yes, the raw count matrix.

machlabd commented 3 years ago

Hi @aopisco, we have managed to get the correct processed version starting from the raw counts, following your comment. Thank you for your help! I will now close this issue.

ceesu commented 3 years ago

Quick question about this, I noticed that the "official-annotations.h5ad" and the "official-raw-obj.h5ad" objects from figshare linked above seem to have the same dimension. Is it the case that both of these files have been filtered according to the methods, which would be that removing "genes that were not expressed in at least 3 cells and then cells that did not have at least 250 detected genes"? I checked the "official-annotations.h5ad" version of the file and in the metadata field "n_counts" (which I think refers to the unnormalized count), the smallest entry has 2500 instead of 250.

Thanks!