Open asmagen opened 5 years ago
Here is a reproducible example (take any four datasets you have localy but don’t change the origins and experiments labels):
pooled_env <- initialize.project(datasets = c("LCMV1", "LCMV2","LCMV3", "LCMV4"), origins = c("LCMV WT", "LCMV WT", "LCMV KO", "LCMV KO"), experiments = c("Rep1", "Rep2","Rep1", "Rep2"), data.path = file.path(tempdir(), "LCMV"), work.path = file.path(tempdir(), "LCMV/LCMV_analysis"))
pooled_env <- read.data(pooled_env)
table(combined$dataset.labels) NA (NA) LCMV KO (Replicate 2) LCMV WT (Replicate 1) 13242 8022 5368
@asmagen The problem after removing the unique command is that rerunning cause the length of origins, experiments, and dataset.labels are modified again, which causes error in the subsequent workflow. See the following output.
> library(robustSingleCell)
> download_LCMV()
> LCMV1 <- initialize.project(datasets = "LCMV1",
origins = "CD44+ cells",
experiments = "Rep1",
data.path = file.path(tempdir(), "LCMV"),
work.path = file.path(tempdir(), "LCMV/LCMV_analysis"))
> LCMV1 <- read.data(LCMV1, subsample = 500)
> LCMV1 <- get.variable.genes(LCMV1)
> summary(LCMV1)
Length Class Mode
datasets 1 -none- character
origins 483 -none- character
experiments 483 -none- character
labels 1 -none- function
data.path 1 -none- character
project 1 -none- character
work.path 1 -none- character
baseline.work.path 1 -none- character
res.data.path 1 -none- character
baseline.data.path 1 -none- character
marker.genes 67 -none- character
genes.filter 27998 -none- logical
counts 5932689 -none- numeric
normalized 5932689 -none- numeric
genes 12283 -none- character
dataset_ids 483 -none- character
dataset.labels 483 -none- character
criteria 500 -none- logical
confounders 2 data.frame list
nsamples 1 -none- numeric
HVG 806 -none- character
> LCMV1 <- read.data(LCMV1, subsample = 500, rerun = T)
> summary(LCMV1)
Length Class Mode
datasets 1 -none- character
origins 233289 -none- character
experiments 233289 -none- character
labels 1 -none- function
data.path 1 -none- character
project 1 -none- character
work.path 1 -none- character
baseline.work.path 1 -none- character
res.data.path 1 -none- character
baseline.data.path 1 -none- character
marker.genes 67 -none- character
genes.filter 27998 -none- logical
counts 5932689 -none- numeric
normalized 5932689 -none- numeric
genes 12283 -none- character
dataset_ids 483 -none- character
dataset.labels 233289 -none- character
criteria 500 -none- logical
confounders 2 data.frame list
nsamples 1 -none- numeric
HVG 806 -none- character
Notice how the length of origins, experiments and dataset.labels are inconsistent across the two run.
Any suggestion on how this should be resolved? The easiest fix is to just disable rerun since we are not going to use this data structure for the next update anyway.
Good catch, thanks for letting me know. Since we need to retain compatibility with earlier versions after the SCE update, I would suggest to set a couple of additional variables (baseline.origins, baseline.experiments) to which the initialize.project is storing origins and experiments. Then read.data reads from baseline.origins, baseline.experiments and saves to origins, experiments. Basically it prevents reading and writing to the same variables, which is what is causing the problem. Lmk if that makes sense.
Found another issue with read.data - genes.filter used to filter genes from the expression matrix is used incorrectly: counts = counts[genes.filter,] normalized = normalized[genes.filter,] where the length of the genes.filter vector is inconsistent with the number of rows in the expression matrix in some instances while the right way to do it is: counts = counts[names(genes.filter)[genes.filter==T],] normalized = normalized[names(genes.filter)[genes.filter==T],] I am going to write an update shortly but you @Mamie should take this into account in the SCE code.
Similarly to read.preclustered.datasets, the
unique
commands needs to be removed from all the following code (taken from read.data function). We have already don't that for read.preclustered.datasets previously. The structure ofinitialize.project
is that each dataset has a value oforigins
andexperiments
which can be duplicated. Applying unique onto these vectors is problematic.See example of how I run initialize.project: