Issue using decontX - Githubissues

jmzvillarreal commented 4 years ago

Hi, I am using decontX to remove possible sources of contamination from a raw count matrix created with Read10X command in Seurat as follows:

WT36_decont <- decontX(counts = as.matrix(WT_36), z = NULL)

and I am getting this error:

Error in .colSumByGroup(counts, group = z, K = K) : INTEGER() can only be applied to a 'integer', not a 'double'

Any clue of what might be wrong ? thanks in advance ! Jaime.

joshua-d-campbell commented 4 years ago

Hi Jaime, Thanks for using the tool. The function expects a matrix that is storing integers. For now, you can use the code below:

mat <- as.matrix(WT_36) storage.mode(mat) <- "integer" WT36_decont <- decontX(counts = mat, z = NULL)

We will add a catch for this so it will be automatically done in the future.

jmzvillarreal commented 4 years ago

Hi Joshua, Thanks very much ! It worked for one matrix of 6000 cells but it took 21 hours .. I am currently running it for more than 8000 cells and it is taking ages.. Is that normal ? Thanks in advance ! Jaime.

joshua-d-campbell commented 4 years ago

Good question. We are working on ways to improve the speed but that is a pretty long time. How many genes are you using? If you can filter the genes before hand, that will greatly speed up the computation.

jmzvillarreal commented 4 years ago

I am using the whole raw count matrix: 31053 rows 8177 columns shall I use normalized one with less genes ? Thanks.

grasskind commented 4 years ago

I think decontX runs on raw counts (so no need to normalize), but it could be useful to remove lowly expressed genes before running it.

Irisapo commented 4 years ago

Hey @grasskind , Yes, indeed. decontX runs on raw counts. And it is useful to remove super lowly expressed genes that you would never care at all before running it

jmzvillarreal commented 4 years ago

Hi all, It worked after filtering the low expressed genes ! But the contamination seems to persist.. I am working with pancreatic cells, and therefore the main source of contamination is the overexpression of acinar enzymes... I see 'ectopic' expression of these genes in non acinar clusters (ie: immune cells).. Thanks, Jaime.

Irisapo commented 4 years ago

Hey @jmzvillarreal,

Can you explain in more details on what you meant by "contamination seems to persist"?

jmzvillarreal commented 4 years ago

Hi Irisapo, I mean that once dowstream analysis is done (Seurat) and I check the expression of top expressed acinar genes, I find that acinar clusters express those genes at the highest levesl but, othe clusters seem to express those very same genes at much lower levels (and they shouldn t , since they express for instance immune markers...) Cheers, Jaime.

Irisapo commented 4 years ago

Hey @jmzvillarreal,

Thanks for clarification on this. A few things to suggest before I explain this situation. Use the raw count matrix for decontamination. And once you get result, I am assuming you used the decontaminated count for your downstream analysis. You can first round decontaminated-count-matrix into integers. Then start your downstream analysis using Seurat or whatever.

One thing to notice, is that, DecontX will remove most the highly contaminated expression, but not all of them. This is both an advantage and disadvantage. As this will be less likely to overcorrect things; and when you actually do downstream analysis, you are actually looking at the relative expressions of the genes, meaning that even though the genes (after decontaminated) still show some expression, they are less likely to confound your major interest of the analysis.

Hope this makes sense.

Best, Shiyi

joshua-d-campbell commented 4 years ago

Hi @jmzvillarreal, just to add to add a few suggestions described. If you are using cell clusters from Seurat, you may want to use a lower resolution to obtain "broader" cell type clusters. For example, if there are several clusters of pancreatic cells, grouping them into a single cluster may help remove additional contamination. And as @Irisapo described, you will hopefully see a large reduction in the contamination, but it may not completely remove it. What is the median contamination levels you are seeing for your dataset?

jmzvillarreal commented 4 years ago

Hi Shiyi and Joshua, I am using the raw count as input for decontamination and for downstream Seurat analysis I am using: WT36_decont$resList$estNativeCounts data to generate the Seurat object. Is that correct ? I am not including cell population labels (z = NULL), would it help including the clusters identity of the non decontaminated analysis ? The contamination of acinar genes in non acinar clusters is aprox of 30%... Thanks ! Jaime.

Irisapo commented 4 years ago

Hi @jmzvillarreal

You were doing it right in terms of using DecontX. It is fine when you don't include cell population labels (z = NULL). But if you have cluster identity, you can include it as well, but it is better to use broad cell types as @joshua-d-campbell described.

Is there anything you observed wrong that the average(?) contamination level of the acinar genes in non acinar clusters is 30%?
Just in case, the value inferred from res$resList$estConp is the contamination level for each cell cross all genes.

joshua-d-campbell commented 4 years ago

We have released an updated version of DecontX which is much faster and resolves this issue. Reinstall the Celda package from Bioconductor or GitHub to get the newest version.

campbio / celda

Issue using decontX #217