Closed NianzhenGu closed 10 months ago
Hi ! thank you for using the issue tracker!
HM_geneset
, or the input to genesetlist
in general in pruneGenesets()
, is a list of gene sets i.e. each element of the list is a list of genes (names or ids depends on row names of x
). This is the set of all possible gene sets one is interested, and must consist of more than one gene set (say C2 collection from MSigDB). The name was selected based on our application (we actually meant Hallmark gene sets), and it does not imply the species. I encourage you to look up the documentation for any of the functions of interest by installing the packages and running ?functionname
The vector space representation computed by scDECAF requires more than one gene signature. So, i'd recommend you consider adding other gene sets to run the model. For your project, for example, differentially expressed genes in differentially abundant neighbourhoods which you can get from miloDE will provide you with sufficient number of gene sets to use as input to scDECAF.
Link to miloDE https://github.com/MarioniLab/miloDE
Hope this helps.
So for example, if I have two gene signatures, s1 = [a, b, c], s2 = [d, e, f]. I can create a geneset like [s1, s2]. Then I run the pruneGenesets()
and genesets2ids()
before thescDECAF()
right?
so, the genesetlist
has to be a named list. so i suggest
gslist = list()
gslist[['gs1']] <- s1
gslist[['gs2']] <- s2
As i mentioned, due to nature of the model we generally need larger than 2 gene sets. you got the order of running the functions correctly, but if your full geneset list has less than 10 gene sets, pruning via pruneGenesets ()
might not be required. Hence why i suggested obtaining additional gene sets from miloDE analysis, for example.
I suggest you also checkout our tutorials from the reproducibility repo https://github.com/DavisLaboratory/scDECAF-reproducibility/blob/master/kang_pbmc/kang_pbmc.ipynb https://github.com/DavisLaboratory/scDECAF-reproducibility/blob/master/cite_pbmc/TotalVI_scDECAF_analysis-addMilo.ipynb
Hope this helps
Great! Thanks for your suggestion! I will try it in the coming days.
Hi! I still have a problem. The picture shows my command of running the scDECAF
. I'm not sure what should I use for the embedding
.
The rest data: merged_counts
is my original data with rows are genes and columns are samples. The dim of this data is 25904 x 4.
target
is the result obtained from genesets2ids()
where rows are genes and columns are genesets. The dim of this data is 106 x 8.
hvg_union
is the list of hvg, the length is 8698.
For the embedding = reducedDims(tumor_sce)[["UMAP"]]
, the tumor_sce
is the SingleCellExperiment object. I'm not sure whether my inputs are correct for the data I showed above.
Appreciate it if you could find the problem! Thanks!
Hi. so the error is suggesting that dim(embedding)!= dim(merged_counts)
. Can you pls verify that? also, i suggest you use log normalised gene expression rather than raw counts, i.e. scTransformed data.
For embedding, you can use umap as you're doing here, but can also consider any other embedding (PCA, PHATE etc with > 2 dimensions).
Hope this helps!
The merged_counts
is log normalized. I tried the PCA but still got the same error:
ok - thanks. Are the row names set for the embedding matrix? scDECAF at some point matches column names in merged_counts
with row names in the embedding. hopefully that fixes?
Sorry, I'm not sure about what you mean. The row name of embedding is the gene name and the column name is PC1, PC2, PC3, PC4. The row name of merged_counts
is the gene name with the same order and the column name is four sample names.
ah then i see what's going wrong. the embedding is a cell embedding ie. has dims n_cells x n_D where D is the dimension in the dimension reduction space. Whereas you are providing a gene embedding. Your initial code was correct because you had reducedDims(tumor_sce)[["UMAP"]]
. Please just check the row names there and verify nrow(educedDims(tumor_sce)[["UMAP"]]) == ncol(merged_counts)
.
So the merged_count
should be n_gene x n_cells and the embedding should be n_cells x n_D. But the dim(merged_count
) will not be equal to dim(embedding)? Also, the column name of merged_count
should be the same as the row name of embedding, which is the cell name, right?
correct. dim(embedding)!= dim(merged_counts)
is always true and i actually meant the error is suggesting nrow(embedding)!= ncol(merged_counts)
, or that row names are not set in embedding
. Apologies for confusion.
also since you only have 8 gene sets, k
should be <8 (you have 10 now). I also updated README with more specifications.
Thanks! Will try it.
Hi, I'm NianzhenGu's teammate. I still cannot run scDECAF successfully. Here's the error:
"merged_logcounts" was defined by "merged_logcounts <- logcounts(merged_sce)" and the logcounts assay was generated by "merged_sce <- logNormCounts(merged_sce)".
"target" was defined by "target <- genesets2ids(merged_logcounts, gene_signature)", where "gene_signature" was a list of geneset, as below:
hvg_union was a vector of highly variable genes we chose.
Reduced dimensions were generated by: merged_sce <- runPCA(merged_sce) merged_sce <- runUMAP(merged_sce, dimred = "PCA") I tried both of them (UMAP and PCA) in scDECAF() but it threw the same error.
Do you have any idea about what could possibly be the problems? Thanks a lot :)
Hey :).
Since data is log-transformed, please set standardize=FALSE
, as per example code on README. Hope this helps.
Hey :). Since data is log-transformed, please set
standardize=FALSE
, as per example code on README. Hope this helps.
Yes, it worked! Thank you so much for your help!
No worries. please close the issue, if this is done!
Hi! I'm working on a single-cell RNA project that compares single-cell transcriptomic data of embryonic and adult mouse colons to identify embryonic-specific gene signatures and use these genes to score colon cancer single-cell data. I find the scDECAF algorithm is suitable for this project.
I have some problems understanding the inputs to the algorithm in the Quick Start part.
geneset
and theHM_geneset
. Does theHM_geneset
represent the human geneset that can be downloaded online? How about the mouse geneset?What I have now is the gene signature, a vector of gene ids, like "ENSMUSG00000031957" "ENSMUSG00000069893" "ENSMUSG00000055827"...; the data I want to score: a SingleCellExperiment object; a list of highly variable genes (hvg).
Very much appreciate it if you could give me some instructions! Thanks!