Closed corecontingency closed 1 year ago
Hello @corecontingency, I would like to help you in getting to the bottom of this. Thank you in advance for your detailed explanation and provided code. I would like to use your sample data so that I may try to reproduce your issue and investigate solutions and/or explanations; do you have a preferred sharing method?
Hi @mattobny . Thank you very much for your help! Do you have an email address I can share a link with? I would send the link in a private message, but I don't think github has that functionality.
@corecontingency surely. My email is matthew.o'brien2@bmc.org
@corecontingency I received your email with the Giotto-related data, but I am missing Neuroblastoma_sc_Jansky.rds
, which you use to create your sign matrix for runDWLSDeconv()
. Are you able to share this with me as well?
@mattobny Apologies! I have uploaded the Neuroblastoma_sc_Jansky.rds file to the same Google Drive link that I shared before.
@corecontingency Thanks for sharing the .rds file. Unfortunately, UpdateSeuratObject()
has continually thrown errors, so I have not been able to replicate this using your data. I've added these errors to the bottom of this response so that we may address the issue at hand first.
Nonetheless, I can begin to provide some insight. runDWLSDeconv()
has an important argument, cluster_column
, which takes "leiden_clus"
by default. Essentially, this clustering information is used to find maxima within expression data, which contributes to the deconvolution. Changing dimensions_to_use
for runUMAP
will change how doLeidenCluster()
clusters your data, which will thus affect the deconvolution by default.
For instance, I created two Giotto objects from the Visium Integration tutorial. The first, N_pros
, was processed the same way you processed your Giotto Object, F4_normalized_standard_6000
. The other, N_pros_more_dim
, used an extra three dimensions when running the UMAP and NN, i.e. dimensions_to_use = 1:20
. Below are the results of how many cells belong to each cluster:
N_pros:
[1] 482 - [2] 414 - [3] 384 - [4] 348 - [5] 306 - [6] 237 - [7] 223 - [8] 184 - [9] 135 - [10] 121 - [11] 119
N_pros_more_dim:
[1] 504 - [2] 399 - [3] 379 - [4] 302 - [5] 294 - [6] 240 - [7] 234 - [8] 204 - [9] 136 - [10] 132 - [11] 129
Altering the dimensions_to_use
in runUMAP
will not only change the amount of instances per cluster, but could also change the amount of clusters detected by doLeidenCluster
, as was the case with your data when I ran the code you provided.
Therefore, although you have provided an annotation matrix for cell typing, Giotto is still employing Leiden cluster information for the enrichment DWLS under the hood, which is likely why your plots appear so different from each other as the annotation and cluster info are not directly related.
I'm tagging @josschavezf and @gcyuan for alternative takes on this issue/explanation of function behavior.
Error upon jansky = UpdateSeuratObject(object = jansky)
Warning message:
Not a validObject(): no slot of name "images" for this object of class "Seurat"
Error in validityMethod(as(object, superClass)) :
object 'CsparseMatrix_validate' not found
@mattobny
Thank you for the help.
Do you know if there are any resources or documentation that can explain in more depth what you mean by this sentence:
runDWLSDeconv() has an important argument, cluster_column, which takes "leiden_clus" by default. Essentially, this clustering information is used to find maxima within expression data, which contributes to the deconvolution.
I am unsure why leiden clustering is needed to find maxima within expression data, or how that contributes to the deconvolution. Copying from my first post with my previous understanding of how DWLS deconvolution works:
runDWLSDeconv takes as input a DWLS sign_matrix, imported from an annotated single-cell data experiment. According to my understanding, it then compares that to the normalized expression data from a spatial data experiment, and for every spatial spot, gives a likelihood score that elements of the annotated groups from the single-cell experiment are present within that spot.
Regarding running the example code, (not sure if that is needed or important anymore, as it was just an example to illustrate my question) I think that as long as you are running the newest version of Seurat (which is 4.2.0), you should be okay to just delete or comment out the offending line:
jansky = UpdateSeuratObject(object = jansky)
@corecontingency
The source code for runDWLSDeconv()
, the internal function enrich_deconvolution()
, and the internal function spot_deconvolution
will further explain my point. You are on the right track, I was just describing a small part of what goes on under-the-hood for generating this output.
cluster_column
is first used within runDWLSDeconv()
to select a column from cell metadata (1898) which will selectively index into an enrichment matrix in enrich_deconvolution()
(1909). This enrichment matrix is created using the sign matrix and the normalized expression data; we determine values for this matrix iteratively, using the cluster_column
. Below is the source code for enrich_deconvolution
for intuition. I was referring to line 1462 in my initial response concerning maxima.
One might think of this step as running an enrichment analysis on spots using the clustering information. That output matrix from enrich_deconvolution()
is then binarized based on spatial resolution (L1916-L1917), and utilized in spot_deconvolution()
. Again, the cluster_column
is crucial because it allows the final matrix to be filled out iteratively, and keeps comparisons consistent. Using the cluster_column
within spot_deconvolution()
, maxima and their respective row names (i.e., which features are enriched to pass the spatial resolution "cutoff") are found within the binarized matrix iteratively. These row names are then further processed to determine a proportion of genes which encompass a spot, thus completing the deconvolution.
Below is the source code for spot_deconvolution
. You will notice that it is rather similar to enrich_deconvolution
, but uses the binarized matrix instead.
In short, the cluster_column
is necessary because we must assume the cells are related in some manner prior to running the deconvolution so that we may have a consistent basis for enrichment comparisons. This is necessary for identification of markers as well (scran
requires at least two distinct groups). Returning to the basis of this issue, it makes sense that your results appear different from each other, as the different amount of leiden clusters (due to variating dimensions_to_use
) has changed the comparisons happening in these functions, as previously described. So, editing your restatement:
findMarkers_one_vs_all()
and makeSignMatrixDWLSfromMatrix()
, as input.cluster_column
dependent, since cell relationships can only be inferred prior to the deconvolution based on the cluster_column
.Let me know if there is anything else I can clear up!
@corecontingency Please let me know if this issue is still developing, or if the behavior of this function now makes sense. In the case of the latter, I'll close this issue.
runDWLSDeconv
takes as input a DWLS sign_matrix, imported from an annotated single-cell data experiment. According to my understanding, it then compares that to the normalized expression data from a spatial data experiment, and for every spatial spot, gives a likelihood score that elements of the annotated groups from the single-cell experiment are present within that spot.Since the
runUMAP
andcreateNearestNetwork
commands (run on the spatial dataset) do not affect the normalized data stored in[["cell"]][["rna"]][["normalized"]]
(and I have verified this is true), how can changing thedimensions_to_use
alter the output ofrunDWLSDeconv
?At the end of this post is a minimal reproducible code example to replicate this problem. Post a reply and let me know and I will share the sample data to run the code. Change the data directories in the code below to point to the downloaded data and update the python path and the bug should be reproducible.
Compare the outputs of the code example with these two lines:
changed to these two lines:
The results of both code runs is below. As you can see, the results are significantly different, which is very unexpected to me from only changing the
dimensions_to_use
from 17 to 20. There should be no difference at all. What is causing this difference? Is there a bug in therunDWLSDeconv
function?Full code example:
Thank you for the help.