Question about running the scDECAF

NianzhenGu commented 10 months ago

Hi! I'm working on a single-cell RNA project that compares single-cell transcriptomic data of embryonic and adult mouse colons to identify embryonic-specific gene signatures and use these genes to score colon cancer single-cell data. I find the scDECAF algorithm is suitable for this project.

I have some problems understanding the inputs to the algorithm in the Quick Start part.

For the variable x, can I put the SingleCellExperiment object?
I don't know the meaning of geneset and the HM_geneset. Does the HM_geneset represent the human geneset that can be downloaded online? How about the mouse geneset?

What I have now is the gene signature, a vector of gene ids, like "ENSMUSG00000031957" "ENSMUSG00000069893" "ENSMUSG00000055827"...; the data I want to score: a SingleCellExperiment object; a list of highly variable genes (hvg).

Very much appreciate it if you could give me some instructions! Thanks!

soroorh commented 10 months ago

Hi ! thank you for using the issue tracker!

we only support standard matrices atm, so you'd have to supply the matrix of normalised expression values of hvg genes
HM_geneset, or the input to genesetlist in general in pruneGenesets(), is a list of gene sets i.e. each element of the list is a list of genes (names or ids depends on row names of x). This is the set of all possible gene sets one is interested, and must consist of more than one gene set (say C2 collection from MSigDB). The name was selected based on our application (we actually meant Hallmark gene sets), and it does not imply the species. I encourage you to look up the documentation for any of the functions of interest by installing the packages and running ?functionname

The vector space representation computed by scDECAF requires more than one gene signature. So, i'd recommend you consider adding other gene sets to run the model. For your project, for example, differentially expressed genes in differentially abundant neighbourhoods which you can get from miloDE will provide you with sufficient number of gene sets to use as input to scDECAF.

Link to miloDE https://github.com/MarioniLab/miloDE

Hope this helps.

NianzhenGu commented 10 months ago

So for example, if I have two gene signatures, s1 = [a, b, c], s2 = [d, e, f]. I can create a geneset like [s1, s2]. Then I run the pruneGenesets() and genesets2ids() before thescDECAF() right?

soroorh commented 10 months ago

so, the genesetlist has to be a named list. so i suggest

gslist = list()
gslist[['gs1']] <- s1
gslist[['gs2']] <- s2

As i mentioned, due to nature of the model we generally need larger than 2 gene sets. you got the order of running the functions correctly, but if your full geneset list has less than 10 gene sets, pruning via pruneGenesets () might not be required. Hence why i suggested obtaining additional gene sets from miloDE analysis, for example.

I suggest you also checkout our tutorials from the reproducibility repo https://github.com/DavisLaboratory/scDECAF-reproducibility/blob/master/kang_pbmc/kang_pbmc.ipynb https://github.com/DavisLaboratory/scDECAF-reproducibility/blob/master/cite_pbmc/TotalVI_scDECAF_analysis-addMilo.ipynb

Hope this helps

NianzhenGu commented 10 months ago

Great! Thanks for your suggestion! I will try it in the coming days.

NianzhenGu commented 10 months ago

Hi! I still have a problem. The picture shows my command of running the scDECAF. I'm not sure what should I use for the embedding.

The rest data: merged_counts is my original data with rows are genes and columns are samples. The dim of this data is 25904 x 4.

target is the result obtained from genesets2ids() where rows are genes and columns are genesets. The dim of this data is 106 x 8.

hvg_union is the list of hvg, the length is 8698.

For the embedding = reducedDims(tumor_sce)[["UMAP"]], the tumor_sce is the SingleCellExperiment object. I'm not sure whether my inputs are correct for the data I showed above.

Appreciate it if you could find the problem! Thanks!

soroorh commented 10 months ago

Hi. so the error is suggesting that dim(embedding)!= dim(merged_counts). Can you pls verify that? also, i suggest you use log normalised gene expression rather than raw counts, i.e. scTransformed data.

For embedding, you can use umap as you're doing here, but can also consider any other embedding (PCA, PHATE etc with > 2 dimensions).

Hope this helps!

NianzhenGu commented 10 months ago

The merged_counts is log normalized. I tried the PCA but still got the same error:

soroorh commented 10 months ago

ok - thanks. Are the row names set for the embedding matrix? scDECAF at some point matches column names in merged_counts with row names in the embedding. hopefully that fixes?

NianzhenGu commented 10 months ago

Sorry, I'm not sure about what you mean. The row name of embedding is the gene name and the column name is PC1, PC2, PC3, PC4. The row name of merged_counts is the gene name with the same order and the column name is four sample names.

soroorh commented 10 months ago

ah then i see what's going wrong. the embedding is a cell embedding ie. has dims n_cells x n_D where D is the dimension in the dimension reduction space. Whereas you are providing a gene embedding. Your initial code was correct because you had reducedDims(tumor_sce)[["UMAP"]]. Please just check the row names there and verify nrow(educedDims(tumor_sce)[["UMAP"]]) == ncol(merged_counts).

NianzhenGu commented 10 months ago

So the merged_count should be n_gene x n_cells and the embedding should be n_cells x n_D. But the dim(merged_count) will not be equal to dim(embedding)? Also, the column name of merged_count should be the same as the row name of embedding, which is the cell name, right?

soroorh commented 10 months ago

correct. dim(embedding)!= dim(merged_counts) is always true and i actually meant the error is suggesting nrow(embedding)!= ncol(merged_counts), or that row names are not set in embedding. Apologies for confusion.

soroorh commented 10 months ago

also since you only have 8 gene sets, k should be <8 (you have 10 now). I also updated README with more specifications.

NianzhenGu commented 10 months ago

Thanks! Will try it.

Jade0904 commented 10 months ago

Hi, I'm NianzhenGu's teammate. I still cannot run scDECAF successfully. Here's the error:

"merged_logcounts" was defined by "merged_logcounts <- logcounts(merged_sce)" and the logcounts assay was generated by "merged_sce <- logNormCounts(merged_sce)".

"target" was defined by "target <- genesets2ids(merged_logcounts, gene_signature)", where "gene_signature" was a list of geneset, as below:

hvg_union was a vector of highly variable genes we chose.

Reduced dimensions were generated by: merged_sce <- runPCA(merged_sce) merged_sce <- runUMAP(merged_sce, dimred = "PCA") I tried both of them (UMAP and PCA) in scDECAF() but it threw the same error.

Do you have any idea about what could possibly be the problems? Thanks a lot :)

soroorh commented 10 months ago

Hey :). Since data is log-transformed, please set standardize=FALSE, as per example code on README. Hope this helps.

Jade0904 commented 10 months ago

Hey :). Since data is log-transformed, please set standardize=FALSE, as per example code on README. Hope this helps.

Yes, it worked! Thank you so much for your help!

soroorh commented 10 months ago

No worries. please close the issue, if this is done!

DavisLaboratory / scDECAF

Question about running the scDECAF #2