Add functions for running statistical models

lcolladotor commented 2 years ago

It'll be easier to run the spatial registration process in your own data if we have functions for doing something like https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/Layer_Guesses/ad_snRNAseq_recast.R where we pseudo-bulk the data then compute t-stats for one cluster vs all others at a time, aka, the enrichment statistics. https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/Layer_Guesses/layer_specificity.R has the code for the whole set: anova and pairwise. Then https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/Layer_Guesses/misc_numbers.R re-shapes the results into what we can load with fetch_data("modeling_results").

This would help @abspangler13 @Nick-Eagles @sparthib and others.

lcolladotor commented 2 years ago

There are a few challenges / questions to resolve that I do not know the answer to.

`aggregateAcrossCells` vs manual pseudo-bulking

Example manual pseudo-bulking code: https://github.com/LieberInstitute/spatialDLPFC/blob/ed33f1fc1124b754028f47bd655ece628c70405c/code/analysis/07_spatial_registration/07_spatial_registration.R#L28-L47
Example aggregateAcrossCells code: https://github.com/LieberInstitute/Visium_IF_AD/blob/17943ea9a1659eac7992e27ab42dc79fb69dbb81/code/08_harmony_BayesSpace/04_heatmaps_dlpfc_markers.R#L86-L94
When we manually pseudo-bulk, we don't set any sizeFactors so they have to be calculated by logNormCounts(). Maybe aggregateAcrossCells() does keep some of the size factors info around. I don't know! Basically, I don't know if we get exactly the same normalized counts with both methods.

design variables

Assuming we use aggregateAcrossCells(), we would need to make sure that any variables we use later in the model like at https://github.com/LieberInstitute/spatialDLPFC/blob/ed33f1fc1124b754028f47bd655ece628c70405c/code/analysis/07_spatial_registration/07_spatial_registration.R#L72 are part of the DataFrame we pass to the function https://github.com/LieberInstitute/Visium_IF_AD/blob/17943ea9a1659eac7992e27ab42dc79fb69dbb81/code/08_harmony_BayesSpace/04_heatmaps_dlpfc_markers.R#L89-L92.

For that, I thought that it could be useful to look at other packages like ExploreModelMatrix which deal with design formulas. I see that they have this nice check to verify that the object passed is actually a formula: https://github.com/csoneson/ExploreModelMatrix/blob/master/R/ExploreModelMatrix.R#L67-L69.

Once we have a formula object, we can extract the terms like with:

fo <- y ~ x1*x2
attributes(terms(fo))$term.labels
# [1] "x1"    "x2"    "x1:x2"

and then we can get the column names with:

terms <- attributes(terms(fo))$term.labels
terms[!grepl(":", terms)]
# [1] "x1" "x2"

Function initial structure

foo <- function(sce, designFormula, block_var = NULL) {

    terms <- attributes(terms(fo))$term.labels
    pd <- colData(sce)[ , c(terms[!grepl(":", terms)], block_var) ]

    sce_pseudo <- aggregateAcrossCells(sce, pd)

    ## then model

    ## then re-arrange results
}

Re-arrange results

We would like to re-arrange the results so they look like spatialLIBD::fetch_data("modeling_results"). That can be done with code from https://github.com/LieberInstitute/HumanPilot/blob/879f11c7c57efd72334332b40feb3ad623e067c8/Analysis/Layer_Guesses/misc_numbers.R#L302-L318.

Check that it's compatible with downstream functions

We would then need to check that it works with http://research.libd.org/spatialLIBD/reference/layer_stat_cor.html and http://research.libd.org/spatialLIBD/reference/layer_stat_cor_plot.html. This new function would basically provide the new stats argument for layer_stat_cor(). Though it could also replace the modeling_results argument, if we want to do spatial registration against a new reference. Note that modeling_results is a list() object.

Add support for `anova` and `pairwise`

We could expand this new function to also run the anova and/or pairwise models. Right now this code only does the enrichment model https://github.com/LieberInstitute/spatialDLPFC/blob/ed33f1fc1124b754028f47bd655ece628c70405c/code/analysis/07_spatial_registration/07_spatial_registration.R#L84-L98.

abspangler13 commented 2 years ago

Question for @lcolladotor: If any terms of the model are not in colData(spe), do we want to stop running the function or just ignore the terms?

We're getting an error at https://github.com/LieberInstitute/spatialDLPFC/blob/16f5dd6b903976ab2954ca7068a43a088a54b214/pseudo_bulk_enrichment/pseudo_bulk_enrichment.R#L147 saying that the rows of res do not match the rows of spe_pseudo.

@sparthib

abspangler13 commented 2 years ago

Yes, we want to get an error if not all of the model terms are in colData(spe). Should be an informative and tell specifically which terms are missing. Use stopifnot(). See end of video recording with Leo on 03/23/22.

Use this type of stop syntax https://github.com/LieberInstitute/spatialLIBD/blob/c6a0e180f770fcbeaccced244a846d7e2a01884a/R/check_spe.R#L70-L75

@sparthib

abspangler13 commented 2 years ago

table of results from this function should only contain t-values from the cluster dataset, not the manual annotations. Then to layer_stat_cor you could provide any two objects created by our function and you would be able to compare any two datasets. Or if you want to compare to the manual annotations you can use fetch data. Result output format has to be the same though.

abspangler13 commented 2 years ago

@lcolladotor Sowmya and I are stuck when we're testing if the new stats dataframe is compatible with layer_stats_cor(). https://github.com/LieberInstitute/spatialDLPFC/blob/727edd1d246aa61a5368bfa41706ba544eac4339/pseudo_bulk_enrichment/pseudo_bulk_enrichment.R#L160-L166

We're getting this error:

Error in cor(external_stats, tstats) : 'x' must be numeric


3: stop("'x' must be numeric")
2: cor(external_stats, tstats)
1: layer_stat_cor(results_specificity, modeling_results, model_type = names(modeling_results)[2], 
       reverse = FALSE, top_n = NULL)```

abspangler13 commented 2 years ago

Notes from code meeting 033022: -use top.table() in f_merge() function https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/toptable. Which part of the code does this replace? -look into purr package to replace sapply and lapply -make it so we can compare any two datasets -add options for pairwise and anova analysis instead of just enrichment

abspangler13 commented 2 years ago

To Do: replace code to load data with user input make sure line 46 of pseudobulk code works generalize mat_formula so that we can ues it in line 103 instead of hard coding the formula add toptable as a wrapper to eBayes function adjust the way we extract p-values since we used topTable (staring line 120)

abspangler13 commented 2 years ago

Error when creating eb0_list_cluster Will set up dsgs

lcolladotor commented 2 years ago

Can you include a GitHub link to the code you were trying to run? Or a reprex? Thanks!

abspangler13 commented 2 years ago

Sure! The version of the script on GitHub isn't fully up to date because of JHPCE being down. I have the new script saved locally. However, the code block here that we're trying to run has stayed the same. https://github.com/LieberInstitute/spatialDLPFC/blob/b69980f6fb698394abe47110753d21e0eaa68d41/pseudo_bulk_enrichment/pseudo_bulk_enrichment.R#L97-L115

And here is the whole up to date script from my computer.

library(SpatialExperiment)
library(here)
library(spatialLIBD)
library(rafalib)
library(scuttle)
library(limma)
library(RColorBrewer)
library(lattice)
library(edgeR)

# #load spe object
# load(file = here::here("processed-data","rdata","spe","01_build_spe","spe_filtered_final.Rdata"),verbose = TRUE)
# 
# #load clusters
# spe <- cluster_import(
#   spe,
#   cluster_dir = here::here("processed-data", "rdata", "spe", "clustering_results"), 
#   prefix = ""
# )

# mat_formula = ~ 0 + bayesSpace_harmony_9 + region + age + sex 
# foo <- function(sce, mat_formula, block_var = NULL) { #must specify in documentation that the second element of the formula is the cluster label, and the first element is zero
#   
#   terms <- attributes(terms(fo))$term.labels
#   pd <- colData(sce)[ , c(terms[!grepl(":", terms)], block_var) ]
#   
#   sce_pseudo <- aggregateAcrossCells(sce, pd)
#   
#   ## then model
#   
#   ## then re-arrange results
# }

cluster <-  attributes(terms(mat_formula))$term.labels[1]

## Pseudo-bulk for our current BayesSpace cluster results
spe_pseudo <- aggregateAcrossCells(
  spe,
  DataFrame(
    BayesSpace = colData(spe)[[cluster]],
    sample_id = spe$sample_id
  )
)

#spe_pseudo <- logNormCounts(spe_pseudo,size.factors = NULL) #change this to use calcNormFactors from edgeR
x <- edgeR::cpm(edgeR::calcNormFactors(spe_pseudo), log = TRUE, prior.count = 1)
## Verify that the gene order hasn't changed
stopifnot(identical(rownames(x), rownames(spe_pseudo)))
## Fix the column names. DGEList will have samples names as Sample1 Sample2 etc
dimnames(x) <- dimnames(spe_pseudo)
## Store the log normalized counts on the SingleCellExperiment object
logcounts(spe_pseudo) <- x
## We don't need this 'x' object anymore
rm(x)

###############################

#find a good expression cutoff using edgeR::filterByExpr https://rdrr.io/bioc/edgeR/man/filterByExpr.html
rowData(spe_pseudo)$low_expr <- filterByExpr(spe_pseudo)
summary(rowData(spe_pseudo)$low_expr)
# Mode   FALSE    TRUE 
# logical   21059    7857 

spe_pseudo <- spe_pseudo[which(!rowData(spe_pseudo)$low_expr),]

##### get mean expression  ####
mat_filter <- assays(spe_pseudo)$logcounts #make matrix of filtered just the log normalized counts

#####################
## Build a group model

#tell user in documentation to make sure their columns are converted to either factors or numerics as appropriate
#convert variables to factors 
# colData(spe_pseudo)[[cluster]] <- as.factor(colData(spe_pseudo)[[cluster]])
# colData(spe_pseudo)$region <- as.factor(colData(spe_pseudo)$region)
# colData(spe_pseudo)$age <- as.numeric(colData(spe_pseudo)$age)
# colData(spe_pseudo)$sex <- as.factor(colData(spe_pseudo)$sex)
# colData(spe_pseudo)$diagnosis <- as.factor(colData(spe_pseudo)$diagnosis)
# colData(spe_pseudo)$subject <- as.factor(colData(spe_pseudo)$subject)

### access different elements of formula and check to see if they're in colData(spe_pseudo)
terms <- attributes(terms(mat_formula))$term.labels
terms <- terms[!grepl(":", terms)]
for(i in seq_along(terms)){
  if(!terms[i] %in% colnames(colData(spe_pseudo))){
    stop("Error: formula term ",terms[i], " is not contained in colData()")
  }
}

#create matrix where the rownames are the sample:clusters and the columns are the other variables (spatial.cluster + region + age + sex)

mod<- model.matrix(mat_formula,
                   data = colData(spe_pseudo)) #binarizes factors 

## get duplicate correlation #http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/dupcor.html
corfit <- duplicateCorrelation(mat_filter, mod,
                               block = spe_pseudo$sample_id)

## Next for each layer test that layer vs the rest
cluster_idx <- splitit(colData(spe)[,cluster]) 

eb0_list_cluster <- lapply(cluster_idx, function(x) {
  res <- rep(0, ncol(spe_pseudo))
  res[x] <- 1
  #new_formula <-substitute(mat_forumula, x=quote(x_part1 + x_part2))
  #attributes(terms(mat_formula))$term.labels[1]<- res #use the original mat_forumala provided by user and replace the first term with res ~ res +region + age + sex
  m <- with(colData(spe_pseudo),
            model.matrix(~ res +region + age + sex)) #will have to change so the formula isn't hard coded 

  #josh suggested use top table as a wrapper because it makes the output of eBayes nicer

    eBayes(
    lmFit(
      mat_filter,
      design = m,
      block = spe_pseudo$sample_id,
      correlation = corfit$consensus.correlation
    )
  )
})

lcolladotor commented 2 years ago

Hm... maybe we need a as.data.frame(colData(spe_pseudo)) when creating the m object. I guess that I don't know what is the exact value of cluster <- attributes(terms(mat_formula))$term.labels[1]. Maybe that could be playing a part with how the formula for creating m is hardcoded right now.

lcolladotor commented 2 years ago

In our DSgs session today we noticed that:

The data was not compatible: the objects were made from different versions. It's an example where reprex could have resolved the problem since reprex re-executes the code from scratch (though it'll take time).
cluster_idx <- splitit(colData(spe)[,cluster]) should have been cluster_idx <- splitit(colData(spe_pseudo)[,cluster]). It was hard for us to notice this error in the input SPE object since the object names are similar. We used lengths(cluster_idx) and were running some code manually (aka x <- cluster_idx[[1]]) when we noticed this issue.
We talked about https://github.com/LieberInstitute/Visium_IF_AD/issues/36 and potential ways of how users could provide the different parts of the models we need. Basically we need to run:

~ 0 + var + others
~ var + others
~ res + others

and need to think if we'll ask users to provide ~ var + others as an input formula, or if we'll use eval() and related functions to execute a character vector as R code.

abspangler13 commented 2 years ago

When running line https://github.com/LieberInstitute/spatialDLPFC/blob/900e99e2cbfb69cfd99df984f5f640aa8f0f3415/pseudo_bulk_enrichment/pseudo_bulk_enrichment.R#L110-L111

abspangler13 commented 2 years ago

abspangler13 commented 2 years ago

Hi @lcolladotor, @sparthib and I are testing the output of our cluster registration function by trying to run it through layer_stat_cor We're getting the error you see posted above. From what we can tell the results_specificity object is in the correct format to be used by layer_stat_cor. We tried making a reprex for this problem (pasted below), but it isn't that helpful because we believe problem is with our specific data object and not the layer_stat_cor function. Will schedule an DSGS to troubleshoot.

library(SpatialExperiment)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#> 
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#>     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#>     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#>     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#>     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#>     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#>     colWeightedMeans, colWeightedMedians, colWeightedSds,
#>     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#>     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#>     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#>     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#>     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#>     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#>     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#>     rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#>     dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#>     grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#>     order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#>     rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#>     union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#> 
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#> 
#>     expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> 
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#> 
#>     rowMedians
#> The following objects are masked from 'package:matrixStats':
#> 
#>     anyMissing, rowMedians
library(here)
#> here() starts at /private/var/folders/5f/b47_7fd14cl_k8lznwjbf1nw0000gn/T/Rtmpi8lHL1/reprex-42df657ff646-awake-topi
library(spatialLIBD)
library(rafalib)
library(scuttle)
library(limma)
#> 
#> Attaching package: 'limma'
#> The following object is masked from 'package:BiocGenerics':
#> 
#>     plotMA
library(RColorBrewer)
library(lattice)
library(edgeR)
#> 
#> Attaching package: 'edgeR'
#> The following object is masked from 'package:SingleCellExperiment':
#> 
#>     cpm

results_specificity <- readRDS(file = "Downloads/results_specificity.RDS")
#> Warning in gzfile(file, "rb"): cannot open compressed file 'Downloads/
#> results_specificity.RDS', probable reason 'No such file or directory'
#> Error in gzfile(file, "rb"): cannot open the connection

modeling_results = fetch_data(type = "modeling_results")
#> snapshotDate(): 2022-04-26
#> adding rname 'https://www.dropbox.com/s/se6rrgb9yhm5gfh/Human_DLPFC_Visium_modeling_results.Rdata?dl=1'
#> 2022-05-25 12:48:26 loading file /Users/abbyspangler/Library/Caches/org.R-project.R/R/BiocFileCache/b0511a989378_Human_DLPFC_Visium_modeling_results.Rdata%3Fdl%3D1

cor <- layer_stat_cor(
  results_specificity,
  modeling_results,
  model_type = names(modeling_results)[2],
  reverse = FALSE,
  top_n = NULL
)
#> Error in rownames(stats): object 'results_specificity' not found

^{Created on 2022-05-25 by the reprex package (v2.0.1)}

lcolladotor commented 2 years ago

You have an error here:

results_specificity <- readRDS(file = "Downloads/results_specificity.RDS")
#> Warning in gzfile(file, "rb"): cannot open compressed file 'Downloads/
#> results_specificity.RDS', probable reason 'No such file or directory'
#> Error in gzfile(file, "rb"): cannot open the connection

Maybe you meant to use "~/Downloads/results_specificity.RDS" for the path?

abspangler13 commented 2 years ago

Next Steps: Make sure pseudobulking is being done correctly. Group variable for filterByExpr() Add in anova and pairwise stats. Function within a function. Can f_merge be added to spatialLIBD? Confirm that the output that we want to return should include p-values and fdr's.

lcolladotor commented 2 years ago

New functions as part of version 1.9.12 resolve this issue. We'll test them on a few datasets, which could lead to some changes.

LieberInstitute / spatialLIBD