drieslab / Giotto

Spatial omics analysis toolbox
https://drieslab.github.io/Giotto_website/
Other
257 stars 98 forks source link

PAGEEnrich_error #447

Open krigia opened 1 year ago

krigia commented 1 year ago

Hello Giotto team,

We try to run "runPAGEEnrich": F1_normalized_standard_6000 = runPAGEEnrich(gobject = F1_normalized_standard_6000, min_overlap_genes = 0, p_value = TRUE, sign_matrix = PAGE_matrix_marker) And we get the following error message:

Error in cut.default(all_perms_num, breaks = nr_groups, labels = group_labels) : invalid number of intervals

Thank you

RubD commented 1 year ago

Hi @krigia

Can you provide a minimal example? Otherwise it's almost impossible for us to locate the potential issue. See also https://giottosuite.readthedocs.io/en/latest/github_issues.html for guidelines on posting issues.

Thanks, Ruben

krigia commented 1 year ago

Hi @RubD I provided the example. Will explain again: Running runRAGEEnrich for a Giotto object : F1_normalized_standard_6000

_F1_normalized_standard_6000 = runPAGEEnrich(gobject = F1_normalized_standard_6000, min_overlap_genes = 0, p_value = TRUE, sign_matrix = PAGE_matrixmarker)

We get the error: _Error in cut.default(all_perms_num, breaks = nr_groups, labels = grouplabels) : invalid number of intervals

Does it help?

Thank you!

krigia commented 1 year ago

Should I add a certain argument in function runPAGEEnrich or modify sign_matrix = PAGE_matrix_marker ??

corecontingency commented 1 year ago

@RubD

Hello Ruben,

I am working with @krigia on this project. A minimal code sample to reproduce the bug is at the bottom of this post. I have also shared the data to use with the attached code to reproduce the bug with your rubendries@gmail.com email address. Just change the data directories in the code below to point to the downloaded data and update the python path and the bug should be reproducible.

Thank you for your help, I greatly appreciate it.


library(Giotto)
library(data.table)
library(magrittr)
library(ggplot2)
library(patchwork)
library(dplyr)
library(tidyr)

# set results directory
results_folder <- '/results_alpha'

# create instructions
instrs <- createGiottoInstructions(save_dir = results_folder,
                                   python_path = "/Users/user/opt/anaconda3/envs/r-reticulate/envs/giotto_env/bin/python",
                                   show_plot = FALSE,
                                   return_plot = FALSE,
                                   save_plot = FALSE,
                                   plot_format = "png",
                                   dpi = 300)

# set data directories
data_path_F1_h5 = "/path/to/dir/F1/outs/raw_feature_bc_matrix.h5"

data_path_F1_tissue_positions = "/path/to/dir/F1/outs/spatial/tissue_positions_list.csv"

data_path_F1_image = "/path/to/dir/F1/outs/spatial/tissue_hires_image.png"

data_path_F1_scalefactors = "/path/to/dir/F1/outs/spatial/scalefactors_json.json"

# create Giotto objects
F1 <- createGiottoVisiumObject(h5_visium_path = data_path_F1_h5,
                h5_tissue_positions_path = data_path_F1_tissue_positions,
                h5_image_png_path = data_path_F1_image,
                h5_json_scalefactors_path = data_path_F1_scalefactors,
                               instructions = instrs)

# subset on spots that were covered by tissue
metadata <- pDataDT(F1)
in_tissue_barcodes <- metadata[in_tissue == 1]$cell_ID
F1 <- subsetGiotto(F1, cell_ids = in_tissue_barcodes)

# clear unused variables
rm(metadata, in_tissue_barcodes)

# filter
F1 <- filterGiotto(gobject = F1,
                  expression_threshold = 1,
                  feat_det_in_min_cells = 5,
                  min_det_feats_per_cell = 1000,
                  expression_values = c('raw'),
                  verbose = T)

F1 <- addStatistics(F1, expression_values = "raw")

## normalize with standard log protocol - scalefactor is set to 6000
# normalize
F1_normalized_standard_6000 <- normalizeGiotto(gobject = F1, norm_methods = "standard", scalefactor = 6000, verbose = T)

# add gene & cell statistics
F1_normalized_standard_6000 <- addStatistics(gobject = F1_normalized_standard_6000)

# calculate HVF for the standard_6000 normalized data
F1_normalized_standard_6000 <- calculateHVF(gobject = F1_normalized_standard_6000)

# run PCA on expression values for the standard_6000 normalized data
gene_metadata = fDataDT(F1_normalized_standard_6000)
featgenes = gene_metadata$feat_ID
F1_normalized_standard_6000 <- runPCA(gobject = F1_normalized_standard_6000, feats_to_use = featgenes)

# clear unused variables
rm(gene_metadata, featgenes)

## run UMAP
# run UMAP on expression values for the standard_6000 normalized data
F1_normalized_standard_6000 <- runUMAP(F1_normalized_standard_6000, dimensions_to_use = 1:15)

## make sNN networks
# sNN network for the standard_6000 normalized data
F1_normalized_standard_6000 <- createNearestNetwork(gobject = F1_normalized_standard_6000, k = 30, dimensions_to_use = 1:15)

## run leiden clustering
# run leiden clustering for the standard_6000 normalized data
F1_normalized_standard_6000 <- doLeidenCluster(gobject = F1_normalized_standard_6000, resolution = 1, n_iterations = 1000)

MYCN = "MYCN"
PHOX2B = "PHOX2B"
Neuroblastoma = c("PHOX2A", "PHOX2B", "ISL1", "STMN2", "NEFM", "PRPH", "NCAM1", "B4GALNT1", "TH", "DBH", "CHGA", "CHGB", "GATA2", "SYP")
PAGE_matrix_marker = makeSignMatrixPAGE(sign_names = c("MYCN", "PHOX2B", "Neuroblastoma"), sign_list = list(MYCN, PHOX2B, Neuroblastoma))

F1_normalized_standard_6000 = runPAGEEnrich(gobject = F1_normalized_standard_6000, min_overlap_genes = 0, p_value = TRUE, sign_matrix = PAGE_matrix_marker)```
RubD commented 1 year ago

Thanks @corecontingency , this is very helpful. We will take a look and try to find a solution. One last question, are you using the latest Giotto Suite version 3.0.1?

packageVersion('Giotto')

RubD commented 1 year ago

Here is my first quick response:

  1. You seemed to call highly variable features/genes, but you don't use it? Not sure if that's intentional or not. This is how you only use hvf in the PCA.

    gene_metadata = fDataDT(F1_normalized_standard_6000)  
    featgenes = gene_metadata[hvf == 'yes']$feat_ID  
    F1_normalized_standard_6000 <- runPCA(gobject = F1_normalized_standard_6000, feats_to_use = featgenes)
  2. The main issue is that you're trying to run an enrichment analysis based on only 1 gene. If you add 2 more genes to the MYCN and PHOX2 gene lists, then it runs fine.

    
    MYCN = c("MYCN")
    PHOX2B = c("PHOX2B")
    Neuroblastoma = c("PHOX2A", "PHOX2B", "ISL1", "STMN2", "NEFM", "PRPH", "NCAM1", "B4GALNT1", "TH", "DBH", "CHGA", "CHGB", "GATA2", "SYP")
    PAGE_matrix_marker = makeSignMatrixPAGE(sign_names = c("MYCN", "PHOX2B", "Neuroblastoma"), sign_list = list(MYCN, PHOX2B, Neuroblastoma))


3. I'm not sure if it would make sense to run an enrichment analysis for a single gene, but it would definitely take more time to implement. One way to see where a gene is 'enriched' at the individual spot level would be to simply visualize the scaled (z-scores) expression values, e.g.
`spatFeatPlot2D(F1_normalized_standard_6000, feats = 'MYCN', expression_values = 'scaled', show_plot = T, cow_n_col = 1, point_size = 2)`

Hope this helps.
krigia commented 1 year ago

@RubD Thank you! This is helpful. Another question: What statistical tool do you apply to run "findMarkers_one_vs_all" function?

https://giottosuite.readthedocs.io/en/latest/subsections/md_rst/findMarkers_one_vs_all.html

Can you provide more information regarding the method = c("scran", "gini", "mast")?

Thank you

krigia commented 1 year ago

@RubD Seurat provides various statistical tools: https://satijalab.org/seurat/reference/findallmarkers

Do any of these tools fall into scran", and "gini" method?

I would greatly appreciate your feedback!

corecontingency commented 1 year ago

@RubD Thank you for your help!

Yes, we were running an old version of Giotto Suite. I have now updated the package. Responses below:

You seemed to call highly variable features/genes, but you don't use it? Not sure if that's intentional or not. This is how you only use hvf in the PCA.

Yes, that was intentional. We are experimenting with hvf on vs. off for our PCAs.

One way to see where a gene is 'enriched' at the individual spot level would be to simply visualize the scaled (z-scores) expression values, e.g.

Thank you! I did not realize this was possible, that is why I was trying to use the PAGE matrix method, instead.

mattobny commented 1 year ago

@krigia Giotto's findMarkers_one_vs_all utilizes a wrapper to the MAST package in one case, with method = "mast". Seurat also has an option to wrap to the MAST package.

Providing the argument method = "scran" to findMarkers_one_vs_all elicits the package scran and wraps to its findMarkers function. Additional parameters to the function in scran may be indicated within findMarkers_one_vs_all thanks to dotparams. Scran's findMarkers has different statistical test capabilities determined by the argument test.type. It utilizes a t-test when test.type is set to "t", a Wilcoxon rank sum test when set to "wilcox", or a binomial test when set to "binom".

More details on the Gini method may be found within our documentation. I hope this answers your questions!

krigia commented 1 year ago

@mattobny yes, thank you. This is helpful. Can you provide more information for the 3 different methods ("cov_groups", "cov_loess", "var_p_resid") in function calculateHVF ? https://giottosuite.readthedocs.io/en/latest/subsections/md_rst/calculateHVF.html

mattobny commented 1 year ago

Gladly. @krigia

When method = "cov_groups", features are binned into average expression groups. The number of bins depends on argument nr_expression_groups. Then, the coefficient of variance (COV) for each feature is converted into a z-score within each bin. Features with a z-score higher than the threshold, determined by the argument zscore_threshold, are considered highly variable.

When method = "cov_loess", a predicted COV is calculated for each feature using loess regression (i.e., COV~log(mean expression)). Features that show a COV higher than that which was predicted are considered highly variable. The threshold difference between predicted and actual to determine if a feature is highly variable may be specified by the argument difference_in_cov.

Specifying method = "var_p_resid" should only be done when the expression data has been normalized by invoking the pearson residual method parameter (i.e, gobject <- normalizeGiotto(gobject, norm_methods = "pearson_resid", ...)). It should be noted that when the expression data is normalized in this manner, the normalized data is returned to the scaled slot within expression by default. This method is described in detail in Lause/Kobak et al. 2021. It produces a set of values that are most similar in utility to a scaled matrix (thus the scaled slot placement) and offer improvements to both HVF detection and PCA generation. Note that these values should not be used for statistics, plotting of expression values, or differential expression analysis.

krigia commented 1 year ago

@mattobny This is very helpful. Thank you very much. We run "cov_groups" and "cov_loess" for default criteria. The downstream UMAP analysis for various PCs (1:10, 1:12, 1:15, 1:17, 1:20, 1:25, 1:30) showed that "cov_groups" is likely better. We would like to try different settings considering " expression_threshold", "nr_expression_groups", " "zscore_threshold". What would be your suggestion? Any further input is highly appreciated.

krigia commented 1 year ago

@RubD @mattobny We have run Giotto for "cov_groups" and "cov_loess" for default criteria with HVF "on"and "off" and we have observed the following: 1. The same % of variance is explained by both methods (Cov_group vs Cov_Loess), 2. It appears that "HVF yes" explains more variance compared to "HVF no" for the majority of the samples. Not clear, but likely "HVF no" vs "HVF yes" slightly separates clusters more distinctly at least for Cov_group 4. "HVF no" compared to "HVF yes" slightly shows higher number of clusters for both stats methods. Have you observed any similar results? Any explanation why "HVF no" likely behaves better compared to "HVF yes" ?