cole-trapnell-lab / garnett

Automated cell type classification
MIT License
104 stars 25 forks source link

Put non-count data in CDS objects #7

Open LuckyMD opened 5 years ago

LuckyMD commented 5 years ago

Hi,

I'm afraid this is more of a question than a bug report. Thus, I am deviating from the template.

I haven't been able to find out how to load pre-processed data into cds objects without getting errors in downstream analysis.

I have patient data integrated with MNN, on which I would like to run garnett. I have run this pre-processing in scanpy, and so am transferring data from python into R to use garnett. Following the monocle and garnett tutorials, I generate the CDS object with:

pd <- AnnotatedDataFrame(data = obs_mon)
fd <- AnnotatedDataFrame(data = var_mon)
colnames(data_mat_mon) <- rownames(pd)
rownames(data_mat_mon) <- rownames(fd)
npcs_cds <- newCellDataSet(cellData=data_mat_mon, phenoData=pd, featureData=fd)

Where obs_mon, and var_mon are pandas dataframes converted to R dataframes, and data_man_mon is a numpy array converted to an R dataframe.

As the data is pre-processed I immediately load my marker gene file and check the markers, and then get the following error: Error: Must run estimateSizeFactors() on cds before calling check_markers

After running npcs_cds <- estimateSizeFactors(npcs_cds), and I get the following error:

Error in quantile.default(newX[, i], ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE
Calls: <Anonymous> ... aggregate_positive_markers -> apply -> FUN -> quantile.default

I assume these errors have to do with certain slots in the cds object not being filled (e.g., reduced dimensions, size factors, dispersions, etc.).

It would be great if you could tell me which slots are missing and how I can assign data to these slots. As I have processed the data in scanpy, I should be able to easily transfer this across.

hpliner commented 5 years ago

I am actually guessing that this is because of NAs in your expression matrix rather than a slot issue. Can you check that sum(is.na(exprs(npcs_cds)) is equal to zero?

LuckyMD commented 5 years ago

Thanks for the quick reply. I just checked and it is zero.

I've run quite a bit of analysis on this data already in scanpy, so I'm pretty sure it's not a data issue. Of course something can have happened during the conversion to R via anndata2ri, but that has worked with previous datasets.

I've previously tried using pre-processed data with monocle and ran into the same issues unfortunately.

Have you run garnett without doing the pre-processing in a cds object?

LuckyMD commented 5 years ago

Hi again. I've gotten garnett to work with count data input. But I was wondering if you could let me know how best to use pre-processed data. I have a strong batch effect in my data and I'm pretty sure the classifier would be better trained without this effect in the data.

Thanks in advance for your help

LuckyMD commented 5 years ago

I have found some instructions on how to load pre-processed data into a cds object here. The same page says the following:

estimateSizeFactors() and estimateDispersions() will only work, and are only needed, if you are working with a CellDataSet with a negbinomial() or negbinomial.size() expression family.

However, when I set expressionFamily=gaussianff() in my call to newCellDataSet(), check_markers() still throws the following error:

Error: Must run estimateSizeFactors() on cds before calling check_markers

Is this intensional? Is garnett currently not compatible with gaussianff()?

Here is the code I'm running:

#Set up the CellDataSet data structure
pd <- AnnotatedDataFrame(data = obs_mon)
fd <- AnnotatedDataFrame(data = var_mon)
colnames(data_mat_mon) <- rownames(pd)
rownames(data_mat_mon) <- rownames(fd)
npcs_cds_norm <- newCellDataSet(cellData=data_mat_mon, phenoData=pd, featureData=fd, expressionFamily=gaussianff())

marker_file_path <- "garnett_marker_genes.txt"

marker_check_norm <- check_markers(npcs_cds_norm, marker_file_path,
                              db=org.Hs.eg.db,
                              cds_gene_id_type = "SYMBOL",
                              marker_file_gene_id_type = "SYMBOL")

Kind regards

hpliner commented 5 years ago

You should still run estimateSizeFactors which I think should still work.

I will note that Garnett has only been tested with count data... it has some internal normalization steps to help it be compatible across datasets that may hurt performance with a prenormalized dataset depending on how it was normalized. I would expect this mostly to be an issue when using pre-trained classifiers or when classifying other count data. But again, it’s gonna depend on your preprocessing steps I think.

Hopefully it will at least run once you have size factors (check that they’re not NAs - you can set NAs to 1 if you’re confident depth across cells is consistent).

Would be curious to hear how it works

LuckyMD commented 5 years ago

Thanks for the response!

If I run estimateSizeFactors() before, would Garnett then use the size factors for a further normalization step while training the classifier or classifying the cells? I would prefer avoiding a further normalization step if possible. I guess I could always set size factors to 1 to avoid this. Are the size factors just stored in pData(cds)$size_factors?

So far the classification unfortunately hasn't worked particularly well for my data with several different iterations of marker gene sets. One reason is probably due to a lack of pre-processing (apart from cell/gene QC and normalization via estimateSizeFactors()) when I run it directly from count data.

hpliner commented 5 years ago

I think it would be fine to set the size factors to 1, yes they are stored at pData(cds)$Size_Factor, so you can just set pData(cd)$Size_Factor <- 1

Curious if this helps your classification, let me know!

LuckyMD commented 5 years ago

I just tried what you suggested and ran estimateSizeFactors() prior to check_markers(). All the size factors are set to 1, but then check_markers() gives me the error:

Error in quantile.default(newX[, i], ...) : 
  missing values and NaN's not allowed if 'na.rm' is FALSE
Calls: <Anonymous> ... aggregate_positive_markers -> apply -> FUN -> quantile.default

any(is.na(pData(cds)$Size_Factors)) however gives me 0. And none of the expression values are NA either (but they are very low). I'm not sure the algorithm works with my MNN-corrected data annoyingly.