asmagen / robustSingleCell

Robust single cell clustering and comparison of population compositions across tissues and experimental models via similarity analysis.
http://dx.doi.org/10.1016/j.celrep.2019.10.131
MIT License
13 stars 2 forks source link

Issue with read.data #30

Open asmagen opened 5 years ago

asmagen commented 5 years ago

Similarly to read.preclustered.datasets, the unique commands needs to be removed from all the following code (taken from read.data function). We have already don't that for read.preclustered.datasets previously. The structure of initialize.project is that each dataset has a value of origins and experiments which can be duplicated. Applying unique onto these vectors is problematic.

dataset.labels <- rep(paste(unique(environment$origins)[environment$datasets ==
                dataset], " (", unique(environment$experiments)[environment$datasets == dataset],
                ")", sep = ""), ncol(measurements))
            origins <- rep(unique(environment$origins)[environment$datasets == dataset],
                ncol(measurements))
            experiments <- rep(unique(environment$experiments)[environment$datasets == dataset],
                ncol(measurements))

See example of how I run initialize.project:

combined <- initialize.project(datasets = c("TbetWT_Exp.1","TbetWT_Exp.2","TbetKO_Exp.1","TbetKO_Exp.2"), 
    origins = c("Tbet NKp46 WT cells","Tbet NKp46 WT cells","Tbet NKp46 KO cells","Tbet NKp46 KO cells"),
    experiments = c("Replicate 1","Replicate 2","Replicate 1","Replicate 2"),
    data.path = '/data/magenay/ILC/data',
    work.path = '/data/magenay/ILC',
    marker.genes = markers)#,clear.history=T
asmagen commented 5 years ago

Here is a reproducible example (take any four datasets you have localy but don’t change the origins and experiments labels):

pooled_env <- initialize.project(datasets = c("LCMV1", "LCMV2","LCMV3", "LCMV4"), origins = c("LCMV WT", "LCMV WT", "LCMV KO", "LCMV KO"), experiments = c("Rep1", "Rep2","Rep1", "Rep2"), data.path = file.path(tempdir(), "LCMV"), work.path = file.path(tempdir(), "LCMV/LCMV_analysis"))

pooled_env <- read.data(pooled_env)

table(combined$dataset.labels) NA (NA) LCMV KO (Replicate 2) LCMV WT (Replicate 1) 13242 8022 5368

Mamie commented 5 years ago

@asmagen The problem after removing the unique command is that rerunning cause the length of origins, experiments, and dataset.labels are modified again, which causes error in the subsequent workflow. See the following output.

> library(robustSingleCell)
> download_LCMV()
> LCMV1 <- initialize.project(datasets = "LCMV1", 
                          origins = "CD44+ cells",
                          experiments = "Rep1",
                          data.path = file.path(tempdir(), "LCMV"),
                          work.path = file.path(tempdir(), "LCMV/LCMV_analysis"))
> LCMV1 <- read.data(LCMV1, subsample = 500)
> LCMV1 <- get.variable.genes(LCMV1)
> summary(LCMV1)
                   Length  Class      Mode     
datasets                 1 -none-     character
origins                483 -none-     character
experiments            483 -none-     character
labels                   1 -none-     function 
data.path                1 -none-     character
project                  1 -none-     character
work.path                1 -none-     character
baseline.work.path       1 -none-     character
res.data.path            1 -none-     character
baseline.data.path       1 -none-     character
marker.genes            67 -none-     character
genes.filter         27998 -none-     logical  
counts             5932689 -none-     numeric  
normalized         5932689 -none-     numeric  
genes                12283 -none-     character
dataset_ids            483 -none-     character
dataset.labels         483 -none-     character
criteria               500 -none-     logical  
confounders              2 data.frame list     
nsamples                 1 -none-     numeric  
HVG                    806 -none-     character

> LCMV1 <- read.data(LCMV1, subsample = 500, rerun = T)
> summary(LCMV1)
                   Length  Class      Mode     
datasets                 1 -none-     character
origins             233289 -none-     character
experiments         233289 -none-     character
labels                   1 -none-     function 
data.path                1 -none-     character
project                  1 -none-     character
work.path                1 -none-     character
baseline.work.path       1 -none-     character
res.data.path            1 -none-     character
baseline.data.path       1 -none-     character
marker.genes            67 -none-     character
genes.filter         27998 -none-     logical  
counts             5932689 -none-     numeric  
normalized         5932689 -none-     numeric  
genes                12283 -none-     character
dataset_ids            483 -none-     character
dataset.labels      233289 -none-     character
criteria               500 -none-     logical  
confounders              2 data.frame list     
nsamples                 1 -none-     numeric  
HVG                    806 -none-     character

Notice how the length of origins, experiments and dataset.labels are inconsistent across the two run.

Any suggestion on how this should be resolved? The easiest fix is to just disable rerun since we are not going to use this data structure for the next update anyway.

asmagen commented 5 years ago

Good catch, thanks for letting me know. Since we need to retain compatibility with earlier versions after the SCE update, I would suggest to set a couple of additional variables (baseline.origins, baseline.experiments) to which the initialize.project is storing origins and experiments. Then read.data reads from baseline.origins, baseline.experiments and saves to origins, experiments. Basically it prevents reading and writing to the same variables, which is what is causing the problem. Lmk if that makes sense.

asmagen commented 5 years ago

Found another issue with read.data - genes.filter used to filter genes from the expression matrix is used incorrectly: counts = counts[genes.filter,] normalized = normalized[genes.filter,] where the length of the genes.filter vector is inconsistent with the number of rows in the expression matrix in some instances while the right way to do it is: counts = counts[names(genes.filter)[genes.filter==T],] normalized = normalized[names(genes.filter)[genes.filter==T],] I am going to write an update shortly but you @Mamie should take this into account in the SCE code.