Combine efforts - Githubissues

csoneson commented 3 years ago

Hi @StefaniAstrologo (as well as co-developers @machlabd and @federicomarini) - as agreed, we'll move the discussion from https://github.com/LTLA/scRNAseq/pull/34 here instead.

So to summarize, it seems that we're all interested in including the Tabula Muris Senis data in Bioconductor, and I think it would make sense to combine things in a single location. To summarize what we have done here (please, @machlabd, correct me if I'm wrong):

we downloaded the bulk, droplet and FACS data from figshare/GEO
we then store the count, rowdata and coldata (as well as reduced dimension information for the single-cell data) in separate objects for upload to ExperimentHub
we wrote accessor functions for each data set that would pull down the requested data and combine/return as a SingleCellExperiment object (HDF5-backed for the single-cell data)

All processing scripts are in the inst/scripts directory. We have not yet uploaded anything to ExperimentHub (was just about to do it this week 😃).

@StefaniAstrologo - it would be great to hear from you if there are things in your processing pipeline that you think would be useful to add here, or to do differently!

StefaniAstrologo commented 3 years ago

Hi @csoneson! Sorry for this late reply! From the user point of you, I think it might be useful is to give the option to download the dataset separately. In my script I extract all the info from h5ad files separately and saved the count, rowdata and coldata per tissue (I haven’t saved pca and umap as you do) . We could either :

save both TabulaMurisSenisBulk/Droplet/FACS as ALL_matrix + as individual files. (the accessor functions will then return n SCE obj, or 1 SCE but reduced size (only the tissues considered))
Or add a filter in the accessor functions and again either return 1 1 SCE or 1 per tissue.

In addition to this, I also gave the option to choose the ages of the samples. N.B. not all tissues have all ages, I made a tibble in the accessor functions to visualise which ages are available for each tissues. Which (e.g. Droplet) looks like this :

 Overview <- tribble(
    ~"Tissue",              ~'1m',  ~'3m',  ~'18m',  ~'21m',  ~'24m',  ~'30m',
    #--------------------|--------|--------|--------|--------|--------|--------|
    "Bladder",              "X",      "X",    "X",     "...",   "X",     "...",     
    "Fat",                  "...",    "...",  "X",     "X",     "...",   "X",                    
    "Heart_and_Aorta",      "X",      "X",    "X",     "X",     "X",     "X",                        
    "Kidney",               "X",      "X",    "X",     "X",     "X",     "X",             
    "Large_Intestine",      "...",    "...",  "...",   "...",   "...",   "X",                 
    "Limb_Muscle",          "X",      "X",    "X",     "X",     "X",     "X",               
    "Liver",                "X",      "X",    "X",     "X",     "X",     "X",    
    "Lung",                 "X",      "X",    "X",     "X",     "...",   "X",               
    "Mammary_Gland",        "...",    "X",    "X",     "X",     "...",   "...",                            
    "Marrow",               "X",      "X",    "X",     "X",     "X",     "X",                     
    "Pancreas",             "...",    "...",  "X",     "X",     "X",     "X",                 
    "Skin",                 "...",    "...",  "X",     "X",     "...",   "...",          
    "Spleen",               "X",      "X",    "X",     "X",     "X",     "X",          
    "Thymus",               "...",    "X",    "X",     "X",     "X",     "...",       
    "Tongue",               "X",      "X",    "X",     "...",   "X",     "...",        
    "Trachea",              "...",    "X",    "...",   "...",   "...",   "...",       
     )

Let me know what do you think :)

csoneson commented 3 years ago

Thank you! We have been discussing a bit on our side, and we agree that it's useful for the user to be able to subset to a given tissue (or several). It's not yet fully clear to us what is the best way of getting to this, mostly given that the full count matrix is currently stored on disk (as an HDF5Array object) and accessed using delayed operations, and the question becomes what is the most efficient way of accessing subsets of it (or combining subsets of it, if each tissue is stored separately). The other aspect we were considering was indeed the processed data and the reduced dimension representations - these are not necessarily meaningful if only a subset of the data is returned, since they were derived from (and are affected by) the full dataset (the same could be said for the cluster assignments).

My feeling right now is that including the full data set in the package would be the "easiest" (and would allow also including these additional slots). Given that all the annotations are in the colData, it would also be easy for a user to subset the data manually without actually loading anything into memory (so the user could decide on which tissue(s)/timepoint(s) to include, or subset on other things like cluster or subtissue). If the resulting dataset is small enough to hold in memory, the matrix can be realized at that point and there would be no issues with slow on-disk access. The main issue here may be that you'd actually need to download the whole dataset (once) and keep it on disk. The full count matrix .h5 file is ~1GB, the normalized data a bit bigger. If you're only ever interested in one tissue, having the ability to download only that object may be advantageous (I don't think this is necessarily an issue for analysis, but it was pointed out that maybe it prevents use for workshops run in settings with little disk space available).

I think we'd need to run some benchmarks to see whether the access/processing speed differs between objects generated by combining multiple subsets or the full object that we have now (or, conversely, between the individual subsets and slices of the full object). It's also not totally clear to me what would be the most common use case - using a single tissue, a subtissue, the whole data set, a subset of several tissues, a given time point, ...

csoneson commented 3 years ago

👋🏻 @StefaniAstrologo - after closer consideration of the tissue datasets, we concluded that providing them individually as well would indeed be beneficial (since they each provide separately normalized data + QC metrics, PCA, tSNE and UMAP). We have included these data sets as well in #7, and they are ready to be uploaded. If you agree, we'd also like to add you as a package author in the DESCRIPTION file. Thanks!

StefaniAstrologo commented 3 years ago

Hi @csoneson! Thanks a lot for considering my suggestion! I am sincerely flattered that you want to include me as an author. Thanks a lot and nice "to meet you" :)

csoneson commented 3 years ago

Great - added here. Let me know if you'd like to include more information (email/ORCID/...).

csoneson commented 3 years ago

Alright - I'm gonna close this issue. Thanks again to everyone for the contributions!

fmicompbio / TabulaMurisSenisData

Combine efforts #5