Facs sorted data - assayed genes

federicomarini commented 5 years ago

Hi,

we (Charlotte and I) are trying to convert the existing h5ad files to a merged SingleCellExperiment object to be used in R/Bioconductor via the https://github.com/csoneson/TabulaMurisData package.

I noticed upon loading the files via scanpy that not all subsets share the same set of genes.

>>> adata_facs_bat.X
<1561x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 2728411 stored elements in Compressed Sparse Row format>
>>> adata_facs_Bladder.X
<1740x16553 sparse matrix of type '<class 'numpy.float32'>'
    with 8153961 stored elements in Compressed Sparse Row format>
>>> adata_facs_Brain_Myeloid.X
<8956x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 15979574 stored elements in Compressed Sparse Row format>
>>> adata_facs_Brain_NonMyeloid.X
<4614x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 12401088 stored elements in Compressed Sparse Row format>
>>> adata_facs_Diaphragm.X
<1608x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 2776493 stored elements in Compressed Sparse Row format>
>>> adata_facs_gat.X
<2531x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 7571811 stored elements in Compressed Sparse Row format>
>>> adata_facs_Heart.X
<3104x21190 sparse matrix of type '<class 'numpy.float32'>'
    with 10448154 stored elements in Compressed Sparse Row format>
>>> adata_facs_Kidney.X
<1400x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 2173284 stored elements in Compressed Sparse Row format>
>>> adata_facs_Large_Intestine.X
<5942x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 22461504 stored elements in Compressed Sparse Row format>
>>> adata_facs_Limb_Muscle.X
<2334x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 4077488 stored elements in Compressed Sparse Row format>
>>> adata_facs_Liver.X
<1679x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 5148416 stored elements in Compressed Sparse Row format>
>>> adata_facs_Lung.X
<3532x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 7385690 stored elements in Compressed Sparse Row format>
>>> adata_facs_Mammary_Gland.X
<3132x17232 sparse matrix of type '<class 'numpy.float32'>'
    with 11823042 stored elements in Compressed Sparse Row format>
>>> adata_facs_Marrow.X
<9734x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 26197974 stored elements in Compressed Sparse Row format>
>>> adata_facs_mat.X
<1960x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 5013970 stored elements in Compressed Sparse Row format>
>>> adata_facs_Pancreas.X
<2551x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 9105733 stored elements in Compressed Sparse Row format>
>>> adata_facs_scat.X
<2723x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 7491027 stored elements in Compressed Sparse Row format>
>>> adata_facs_Skin.X
<3468x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 10444947 stored elements in Compressed Sparse Row format>
>>> adata_facs_Spleen.X
<2812x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 5185231 stored elements in Compressed Sparse Row format>
>>> adata_facs_Thymus.X
<2629x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 5565327 stored elements in Compressed Sparse Row format>
>>> adata_facs_Tongue.X
<2776x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 11608627 stored elements in Compressed Sparse Row format>
>>> adata_facs_Trachea.X
<2353x22899 sparse matrix of type '<class 'numpy.float32'>'
    with 6583927 stored elements in Compressed Sparse Row format>

I see the majority have 22899 genes in them, so I was wondering whether additional steps were applied to the files that have less - ideally, filtered out if not detected in any cell?

For the droplet data, this problem does not show up and all subsets 19860 genes.

Would it be possible to have the "original data" uploaded also for Bladder, MammaryGland, and Heart? (assuming the genes are all in the same order)

I'm tagging @csoneson to follow up on this one.

Thanks in advance!

Federico

aopisco commented 5 years ago

Hi @federicomarini, all the raw data is now available from AWS and the code used for processing will be made available here really soon!

federicomarini commented 5 years ago

Excellent, thanks a lot @aopisco !

I will make sure I'll be on track with @csoneson to proceed with the next steps.

Federico

aopisco commented 5 years ago

Please reach out if you need something, closing the issue for now!

czbiohub-sf / tabula-muris-senis

Facs sorted data - assayed genes #2