drieslab / Giotto

Spatial omics analysis toolbox
https://drieslab.github.io/Giotto_website/
Other
267 stars 99 forks source link

Changing dimensions_to_use in runUMAP and createNearestNetwork alters output of runDWLSDeconv #483

Closed corecontingency closed 1 year ago

corecontingency commented 1 year ago

runDWLSDeconv takes as input a DWLS sign_matrix, imported from an annotated single-cell data experiment. According to my understanding, it then compares that to the normalized expression data from a spatial data experiment, and for every spatial spot, gives a likelihood score that elements of the annotated groups from the single-cell experiment are present within that spot.

Since the runUMAP and createNearestNetwork commands (run on the spatial dataset) do not affect the normalized data stored in [["cell"]][["rna"]][["normalized"]] (and I have verified this is true), how can changing the dimensions_to_use alter the output of runDWLSDeconv?

At the end of this post is a minimal reproducible code example to replicate this problem. Post a reply and let me know and I will share the sample data to run the code. Change the data directories in the code below to point to the downloaded data and update the python path and the bug should be reproducible.

Compare the outputs of the code example with these two lines:

F4_normalized_standard_6000 <- runUMAP(F4_normalized_standard_6000, dimensions_to_use = 1:17)
...
F4_normalized_standard_6000 <- createNearestNetwork(gobject = F4_normalized_standard_6000, k = 30, dimensions_to_use = 1:17)

changed to these two lines:

F4_normalized_standard_6000 <- runUMAP(F4_normalized_standard_6000, dimensions_to_use = 1:20)
...
F4_normalized_standard_6000 <- createNearestNetwork(gobject = F4_normalized_standard_6000, k = 30, dimensions_to_use = 1:20)

The results of both code runs is below. As you can see, the results are significantly different, which is very unexpected to me from only changing the dimensions_to_use from 17 to 20. There should be no difference at all. What is causing this difference? Is there a bug in the runDWLSDeconv function?

F4_jansky_spatPlot_DWLS_dim_1_17

F4_jansky_spatPlot_DWLS_dim_1_20

Full code example:

library(Giotto)
library(data.table)
library(magrittr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(tidyr)
library(copykat)
library(Seurat)

# set results directory
results_folder <- '/home/user/src/results_alpha'

# create instructions
instrs <- createGiottoInstructions(save_dir = results_folder,
                                   python_path = "/home/user/.local/share/r-miniconda/bin/python",
                                   show_plot = FALSE,
                                   return_plot = FALSE,
                                   save_plot = FALSE,
                                   plot_format = "png",
                                   dpi = 300)

# set data directories
data_path_F4_h5 = "/path/to/dir/F4/outs/raw_feature_bc_matrix.h5"

data_path_F4_tissue_positions = "/path/to/dir/F4/outs/spatial/tissue_positions_list.csv"

data_path_F4_image = "/path/to/dir/F4/outs/spatial/tissue_hires_image.png"

data_path_F4_scalefactors = "/path/to/dir/F4/outs/spatial/scalefactors_json.json"

# create Giotto objects
F4 <- createGiottoVisiumObject(h5_visium_path = data_path_F4_h5,
                h5_tissue_positions_path = data_path_F4_tissue_positions,
                h5_image_png_path = data_path_F4_image,
                h5_json_scalefactors_path = data_path_F4_scalefactors,
                               instructions = instrs)

# subset on spots that were covered by tissue
metadata <- pDataDT(F4)
in_tissue_barcodes <- metadata[in_tissue == 1]$cell_ID
F4 <- subsetGiotto(F4, cell_ids = in_tissue_barcodes)

# clear unused variables
rm(metadata, in_tissue_barcodes)

# filter
F4 <- filterGiotto(gobject = F4,
                  expression_threshold = 1,
                  feat_det_in_min_cells = 5,
                  min_det_feats_per_cell = 1000,
                  expression_values = c('raw'),
                  verbose = T)

F4 <- addStatistics(F4, expression_values = "raw")

## normalize with standard log protocol - scalefactor is set to 6000
# normalize
F4_normalized_standard_6000 <- normalizeGiotto(gobject = F4, norm_methods = "standard", scalefactor = 6000, verbose = T)

# add gene & cell statistics
F4_normalized_standard_6000 <- addStatistics(gobject = F4_normalized_standard_6000)

# calculate HVF for the standard_6000 normalized data
F4_normalized_standard_6000 <- calculateHVF(gobject = F4_normalized_standard_6000)

# run PCA on expression values for the standard_6000 normalized data
gene_metadata = fDataDT(F4_normalized_standard_6000)
featgenes = gene_metadata[hvf == 'yes']$feat_ID
F4_normalized_standard_6000 <- runPCA(gobject = F4_normalized_standard_6000, feats_to_use = featgenes)

# clear unused variables
rm(gene_metadata, featgenes)

## run UMAP
# run UMAP on expression values for the standard_6000 normalized data
F4_normalized_standard_6000 <- runUMAP(F4_normalized_standard_6000, dimensions_to_use = 1:17)

## make sNN networks
# sNN network for the standard_6000 normalized data
F4_normalized_standard_6000 <- createNearestNetwork(gobject = F4_normalized_standard_6000, k = 30, dimensions_to_use = 1:17)

## run leiden clustering
# run leiden clustering for the standard_6000 normalized data
F4_normalized_standard_6000 <- doLeidenCluster(gobject = F4_normalized_standard_6000, resolution = 1, n_iterations = 1000)

rds_file = "/path/to/dir/Neuroblastoma_sc_Jansky.rds"
jansky <- readRDS(rds_file)
rm(rds_file)

jansky = UpdateSeuratObject(object = jansky)

jansky <- subset(jansky, subset = nFeature_RNA > 500 & nCount_RNA > 1000 & percent.mt < 2.5)

jansky <- SCTransform(jansky, vars.to.regress = "percent.mt", verbose = TRUE)

jansky_sc_matrix <- jansky@assays[["SCT"]]@counts

featgenes = jansky@assays[["RNA"]]@counts@Dimnames[[1]]
jansky <- RunPCA(jansky, features = featgenes, npcs = 15)
jansky <- FindNeighbors(jansky, dims = 1:15)
jansky <- FindClusters(jansky, resolution = 0.5)

jansky_markers <- FindAllMarkers(
  object = jansky,
  assay = "SCT",
  test.use = "wilcox")

dt_jansky_markers <- data.table(jansky_markers)
top_markers <- dt_jansky_markers[, head(.SD, 10), by="cluster"]
jansky_sc_sign_gene <- top_markers$gene

jansky_sc_cell_type <- jansky$anno_new

DWLS_matrix<-makeSignMatrixDWLSfromMatrix(matrix = jansky_sc_matrix,
                                          cell_type = jansky_sc_cell_type,
                                          sign_gene = jansky_sc_sign_gene)

F4_normalized_standard_6000 = runDWLSDeconv(gobject = F4_normalized_standard_6000, sign_matrix = DWLS_matrix)

jansky_anno_options <- unique(as.vector(as.data.frame(jansky_sc_cell_type)[, ]))

spatCellPlot2D(gobject = F4_normalized_standard_6000,
               spat_enr_names = 'DWLS',
               cell_annotation_values = jansky_anno_options,
               cow_n_col = 2, coord_fix_ratio = 1, point_size = 1,
               save_plot = TRUE,
               save_param = list(save_folder = "DWLS_spatPlot", save_name = "F4_jansky_spatPlot_DWLS"))

Thank you for the help.

mattobny commented 1 year ago

Hello @corecontingency, I would like to help you in getting to the bottom of this. Thank you in advance for your detailed explanation and provided code. I would like to use your sample data so that I may try to reproduce your issue and investigate solutions and/or explanations; do you have a preferred sharing method?

corecontingency commented 1 year ago

Hi @mattobny . Thank you very much for your help! Do you have an email address I can share a link with? I would send the link in a private message, but I don't think github has that functionality.

mattobny commented 1 year ago

@corecontingency surely. My email is matthew.o'brien2@bmc.org

mattobny commented 1 year ago

@corecontingency I received your email with the Giotto-related data, but I am missing Neuroblastoma_sc_Jansky.rds, which you use to create your sign matrix for runDWLSDeconv(). Are you able to share this with me as well?

corecontingency commented 1 year ago

@mattobny Apologies! I have uploaded the Neuroblastoma_sc_Jansky.rds file to the same Google Drive link that I shared before.

mattobny commented 1 year ago

@corecontingency Thanks for sharing the .rds file. Unfortunately, UpdateSeuratObject() has continually thrown errors, so I have not been able to replicate this using your data. I've added these errors to the bottom of this response so that we may address the issue at hand first.

Nonetheless, I can begin to provide some insight. runDWLSDeconv() has an important argument, cluster_column, which takes "leiden_clus" by default. Essentially, this clustering information is used to find maxima within expression data, which contributes to the deconvolution. Changing dimensions_to_use for runUMAP will change how doLeidenCluster() clusters your data, which will thus affect the deconvolution by default.

For instance, I created two Giotto objects from the Visium Integration tutorial. The first, N_pros, was processed the same way you processed your Giotto Object, F4_normalized_standard_6000. The other, N_pros_more_dim, used an extra three dimensions when running the UMAP and NN, i.e. dimensions_to_use = 1:20. Below are the results of how many cells belong to each cluster:

N_pros: 
[1] 482 - [2] 414 - [3] 384 - [4] 348 - [5] 306 - [6] 237 - [7] 223 - [8] 184 - [9] 135 - [10] 121 - [11] 119

N_pros_more_dim: 
[1] 504 - [2] 399 - [3] 379 - [4] 302 - [5] 294 - [6] 240 - [7] 234 - [8] 204 - [9] 136 - [10] 132 - [11] 129

Altering the dimensions_to_use in runUMAP will not only change the amount of instances per cluster, but could also change the amount of clusters detected by doLeidenCluster, as was the case with your data when I ran the code you provided.

Therefore, although you have provided an annotation matrix for cell typing, Giotto is still employing Leiden cluster information for the enrichment DWLS under the hood, which is likely why your plots appear so different from each other as the annotation and cluster info are not directly related.

I'm tagging @josschavezf and @gcyuan for alternative takes on this issue/explanation of function behavior.


Error upon jansky = UpdateSeuratObject(object = jansky)

Warning message:
Not a validObject(): no slot of name "images" for this object of class "Seurat"

Error in validityMethod(as(object, superClass)) : 
  object 'CsparseMatrix_validate' not found
corecontingency commented 1 year ago

@mattobny

Thank you for the help.

Do you know if there are any resources or documentation that can explain in more depth what you mean by this sentence:

runDWLSDeconv() has an important argument, cluster_column, which takes "leiden_clus" by default. Essentially, this clustering information is used to find maxima within expression data, which contributes to the deconvolution.

I am unsure why leiden clustering is needed to find maxima within expression data, or how that contributes to the deconvolution. Copying from my first post with my previous understanding of how DWLS deconvolution works:

runDWLSDeconv takes as input a DWLS sign_matrix, imported from an annotated single-cell data experiment. According to my understanding, it then compares that to the normalized expression data from a spatial data experiment, and for every spatial spot, gives a likelihood score that elements of the annotated groups from the single-cell experiment are present within that spot.

Regarding running the example code, (not sure if that is needed or important anymore, as it was just an example to illustrate my question) I think that as long as you are running the newest version of Seurat (which is 4.2.0), you should be okay to just delete or comment out the offending line:

jansky = UpdateSeuratObject(object = jansky)

mattobny commented 1 year ago

@corecontingency

The source code for runDWLSDeconv(), the internal function enrich_deconvolution(), and the internal function spot_deconvolution will further explain my point. You are on the right track, I was just describing a small part of what goes on under-the-hood for generating this output.

https://github.com/drieslab/Giotto/blob/370e3663f3d2162d9ce2a3396d72ed534c11b842/R/spatial_enrichment.R#L1888-L1923

cluster_column is first used within runDWLSDeconv() to select a column from cell metadata (1898) which will selectively index into an enrichment matrix in enrich_deconvolution()(1909). This enrichment matrix is created using the sign matrix and the normalized expression data; we determine values for this matrix iteratively, using the cluster_column. Below is the source code for enrich_deconvolution for intuition. I was referring to line 1462 in my initial response concerning maxima.

https://github.com/drieslab/Giotto/blob/370e3663f3d2162d9ce2a3396d72ed534c11b842/R/spatial_enrichment.R#L1429-L1487

One might think of this step as running an enrichment analysis on spots using the clustering information. That output matrix from enrich_deconvolution() is then binarized based on spatial resolution (L1916-L1917), and utilized in spot_deconvolution(). Again, the cluster_column is crucial because it allows the final matrix to be filled out iteratively, and keeps comparisons consistent. Using the cluster_column within spot_deconvolution(), maxima and their respective row names (i.e., which features are enriched to pass the spatial resolution "cutoff") are found within the binarized matrix iteratively. These row names are then further processed to determine a proportion of genes which encompass a spot, thus completing the deconvolution.

Below is the source code for spot_deconvolution. You will notice that it is rather similar to enrich_deconvolution, but uses the binarized matrix instead.

https://github.com/drieslab/Giotto/blob/370e3663f3d2162d9ce2a3396d72ed534c11b842/R/spatial_enrichment.R#L1493-L1560

In short, the cluster_column is necessary because we must assume the cells are related in some manner prior to running the deconvolution so that we may have a consistent basis for enrichment comparisons. This is necessary for identification of markers as well (scran requires at least two distinct groups). Returning to the basis of this issue, it makes sense that your results appear different from each other, as the different amount of leiden clusters (due to variating dimensions_to_use) has changed the comparisons happening in these functions, as previously described. So, editing your restatement:

Let me know if there is anything else I can clear up!

mattobny commented 1 year ago

@corecontingency Please let me know if this issue is still developing, or if the behavior of this function now makes sense. In the case of the latter, I'll close this issue.