Closed lcolladotor closed 2 years ago
There are a few challenges / questions to resolve that I do not know the answer to.
aggregateAcrossCells
vs manual pseudo-bulkingaggregateAcrossCells
code: https://github.com/LieberInstitute/Visium_IF_AD/blob/17943ea9a1659eac7992e27ab42dc79fb69dbb81/code/08_harmony_BayesSpace/04_heatmaps_dlpfc_markers.R#L86-L94sizeFactors
so they have to be calculated by logNormCounts()
. Maybe aggregateAcrossCells()
does keep some of the size factors info around. I don't know! Basically, I don't know if we get exactly the same normalized counts with both methods.Assuming we use aggregateAcrossCells()
, we would need to make sure that any variables we use later in the model like at https://github.com/LieberInstitute/spatialDLPFC/blob/ed33f1fc1124b754028f47bd655ece628c70405c/code/analysis/07_spatial_registration/07_spatial_registration.R#L72 are part of the DataFrame
we pass to the function https://github.com/LieberInstitute/Visium_IF_AD/blob/17943ea9a1659eac7992e27ab42dc79fb69dbb81/code/08_harmony_BayesSpace/04_heatmaps_dlpfc_markers.R#L89-L92.
For that, I thought that it could be useful to look at other packages like ExploreModelMatrix
which deal with design formulas. I see that they have this nice check to verify that the object passed is actually a formula: https://github.com/csoneson/ExploreModelMatrix/blob/master/R/ExploreModelMatrix.R#L67-L69.
Once we have a formula object, we can extract the terms like with:
fo <- y ~ x1*x2
attributes(terms(fo))$term.labels
# [1] "x1" "x2" "x1:x2"
and then we can get the column names with:
terms <- attributes(terms(fo))$term.labels
terms[!grepl(":", terms)]
# [1] "x1" "x2"
foo <- function(sce, designFormula, block_var = NULL) {
terms <- attributes(terms(fo))$term.labels
pd <- colData(sce)[ , c(terms[!grepl(":", terms)], block_var) ]
sce_pseudo <- aggregateAcrossCells(sce, pd)
## then model
## then re-arrange results
}
We would like to re-arrange the results so they look like spatialLIBD::fetch_data("modeling_results")
. That can be done with code from https://github.com/LieberInstitute/HumanPilot/blob/879f11c7c57efd72334332b40feb3ad623e067c8/Analysis/Layer_Guesses/misc_numbers.R#L302-L318.
We would then need to check that it works with http://research.libd.org/spatialLIBD/reference/layer_stat_cor.html and http://research.libd.org/spatialLIBD/reference/layer_stat_cor_plot.html. This new function would basically provide the new stats
argument for layer_stat_cor()
. Though it could also replace the modeling_results
argument, if we want to do spatial registration against a new reference. Note that modeling_results
is a list()
object.
anova
and pairwise
We could expand this new function to also run the anova
and/or pairwise
models. Right now this code only does the enrichment
model https://github.com/LieberInstitute/spatialDLPFC/blob/ed33f1fc1124b754028f47bd655ece628c70405c/code/analysis/07_spatial_registration/07_spatial_registration.R#L84-L98.
Question for @lcolladotor:
If any terms of the model are not in colData(spe)
, do we want to stop running the function or just ignore the terms?
We're getting an error at https://github.com/LieberInstitute/spatialDLPFC/blob/16f5dd6b903976ab2954ca7068a43a088a54b214/pseudo_bulk_enrichment/pseudo_bulk_enrichment.R#L147 saying that the rows of res do not match the rows of spe_pseudo
.
@sparthib
Yes, we want to get an error if not all of the model terms are in colData(spe). Should be an informative and tell specifically which terms are missing. Use stopifnot(). See end of video recording with Leo on 03/23/22.
Use this type of stop syntax https://github.com/LieberInstitute/spatialLIBD/blob/c6a0e180f770fcbeaccced244a846d7e2a01884a/R/check_spe.R#L70-L75
@sparthib
table of results from this function should only contain t-values from the cluster dataset, not the manual annotations. Then to layer_stat_cor you could provide any two objects created by our function and you would be able to compare any two datasets. Or if you want to compare to the manual annotations you can use fetch data. Result output format has to be the same though.
@lcolladotor
Sowmya and I are stuck when we're testing if the new stats
dataframe is compatible with layer_stats_cor()
. https://github.com/LieberInstitute/spatialDLPFC/blob/727edd1d246aa61a5368bfa41706ba544eac4339/pseudo_bulk_enrichment/pseudo_bulk_enrichment.R#L160-L166
We're getting this error:
Error in cor(external_stats, tstats) : 'x' must be numeric
3: stop("'x' must be numeric")
2: cor(external_stats, tstats)
1: layer_stat_cor(results_specificity, modeling_results, model_type = names(modeling_results)[2],
reverse = FALSE, top_n = NULL)```
Notes from code meeting 033022: -use top.table() in f_merge() function https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/toptable. Which part of the code does this replace? -look into purr package to replace sapply and lapply -make it so we can compare any two datasets -add options for pairwise and anova analysis instead of just enrichment
To Do: replace code to load data with user input make sure line 46 of pseudobulk code works generalize mat_formula so that we can ues it in line 103 instead of hard coding the formula add toptable as a wrapper to eBayes function adjust the way we extract p-values since we used topTable (staring line 120)
Error when creating eb0_list_cluster
Will set up dsgs
Can you include a GitHub link to the code you were trying to run? Or a reprex
? Thanks!
Sure! The version of the script on GitHub isn't fully up to date because of JHPCE being down. I have the new script saved locally. However, the code block here that we're trying to run has stayed the same. https://github.com/LieberInstitute/spatialDLPFC/blob/b69980f6fb698394abe47110753d21e0eaa68d41/pseudo_bulk_enrichment/pseudo_bulk_enrichment.R#L97-L115
And here is the whole up to date script from my computer.
library(SpatialExperiment)
library(here)
library(spatialLIBD)
library(rafalib)
library(scuttle)
library(limma)
library(RColorBrewer)
library(lattice)
library(edgeR)
# #load spe object
# load(file = here::here("processed-data","rdata","spe","01_build_spe","spe_filtered_final.Rdata"),verbose = TRUE)
#
# #load clusters
# spe <- cluster_import(
# spe,
# cluster_dir = here::here("processed-data", "rdata", "spe", "clustering_results"),
# prefix = ""
# )
# mat_formula = ~ 0 + bayesSpace_harmony_9 + region + age + sex
# foo <- function(sce, mat_formula, block_var = NULL) { #must specify in documentation that the second element of the formula is the cluster label, and the first element is zero
#
# terms <- attributes(terms(fo))$term.labels
# pd <- colData(sce)[ , c(terms[!grepl(":", terms)], block_var) ]
#
# sce_pseudo <- aggregateAcrossCells(sce, pd)
#
# ## then model
#
# ## then re-arrange results
# }
cluster <- attributes(terms(mat_formula))$term.labels[1]
## Pseudo-bulk for our current BayesSpace cluster results
spe_pseudo <- aggregateAcrossCells(
spe,
DataFrame(
BayesSpace = colData(spe)[[cluster]],
sample_id = spe$sample_id
)
)
#spe_pseudo <- logNormCounts(spe_pseudo,size.factors = NULL) #change this to use calcNormFactors from edgeR
x <- edgeR::cpm(edgeR::calcNormFactors(spe_pseudo), log = TRUE, prior.count = 1)
## Verify that the gene order hasn't changed
stopifnot(identical(rownames(x), rownames(spe_pseudo)))
## Fix the column names. DGEList will have samples names as Sample1 Sample2 etc
dimnames(x) <- dimnames(spe_pseudo)
## Store the log normalized counts on the SingleCellExperiment object
logcounts(spe_pseudo) <- x
## We don't need this 'x' object anymore
rm(x)
###############################
#find a good expression cutoff using edgeR::filterByExpr https://rdrr.io/bioc/edgeR/man/filterByExpr.html
rowData(spe_pseudo)$low_expr <- filterByExpr(spe_pseudo)
summary(rowData(spe_pseudo)$low_expr)
# Mode FALSE TRUE
# logical 21059 7857
spe_pseudo <- spe_pseudo[which(!rowData(spe_pseudo)$low_expr),]
##### get mean expression ####
mat_filter <- assays(spe_pseudo)$logcounts #make matrix of filtered just the log normalized counts
#####################
## Build a group model
#tell user in documentation to make sure their columns are converted to either factors or numerics as appropriate
#convert variables to factors
# colData(spe_pseudo)[[cluster]] <- as.factor(colData(spe_pseudo)[[cluster]])
# colData(spe_pseudo)$region <- as.factor(colData(spe_pseudo)$region)
# colData(spe_pseudo)$age <- as.numeric(colData(spe_pseudo)$age)
# colData(spe_pseudo)$sex <- as.factor(colData(spe_pseudo)$sex)
# colData(spe_pseudo)$diagnosis <- as.factor(colData(spe_pseudo)$diagnosis)
# colData(spe_pseudo)$subject <- as.factor(colData(spe_pseudo)$subject)
### access different elements of formula and check to see if they're in colData(spe_pseudo)
terms <- attributes(terms(mat_formula))$term.labels
terms <- terms[!grepl(":", terms)]
for(i in seq_along(terms)){
if(!terms[i] %in% colnames(colData(spe_pseudo))){
stop("Error: formula term ",terms[i], " is not contained in colData()")
}
}
#create matrix where the rownames are the sample:clusters and the columns are the other variables (spatial.cluster + region + age + sex)
mod<- model.matrix(mat_formula,
data = colData(spe_pseudo)) #binarizes factors
## get duplicate correlation #http://web.mit.edu/~r/current/arch/i386_linux26/lib/R/library/limma/html/dupcor.html
corfit <- duplicateCorrelation(mat_filter, mod,
block = spe_pseudo$sample_id)
## Next for each layer test that layer vs the rest
cluster_idx <- splitit(colData(spe)[,cluster])
eb0_list_cluster <- lapply(cluster_idx, function(x) {
res <- rep(0, ncol(spe_pseudo))
res[x] <- 1
#new_formula <-substitute(mat_forumula, x=quote(x_part1 + x_part2))
#attributes(terms(mat_formula))$term.labels[1]<- res #use the original mat_forumala provided by user and replace the first term with res ~ res +region + age + sex
m <- with(colData(spe_pseudo),
model.matrix(~ res +region + age + sex)) #will have to change so the formula isn't hard coded
#josh suggested use top table as a wrapper because it makes the output of eBayes nicer
eBayes(
lmFit(
mat_filter,
design = m,
block = spe_pseudo$sample_id,
correlation = corfit$consensus.correlation
)
)
})
Hm... maybe we need a as.data.frame(colData(spe_pseudo))
when creating the m
object. I guess that I don't know what is the exact value of cluster <- attributes(terms(mat_formula))$term.labels[1]
. Maybe that could be playing a part with how the formula for creating m
is hardcoded right now.
In our DSgs session today we noticed that:
reprex
could have resolved the problem since reprex
re-executes the code from scratch (though it'll take time).cluster_idx <- splitit(colData(spe)[,cluster])
should have been cluster_idx <- splitit(colData(spe_pseudo)[,cluster])
. It was hard for us to notice this error in the input SPE object since the object names are similar. We used lengths(cluster_idx)
and were running some code manually (aka x <- cluster_idx[[1]]
) when we noticed this issue.~ 0 + var + others
~ var + others
~ res + others
and need to think if we'll ask users to provide ~ var + others
as an input formula, or if we'll use eval()
and related functions to execute a character vector as R code.
Hi @lcolladotor, @sparthib and I are testing the output of our cluster registration function by trying to run it through layer_stat_cor
We're getting the error you see posted above. From what we can tell the results_specificity
object is in the correct format to be used by layer_stat_cor
. We tried making a reprex for this problem (pasted below), but it isn't that helpful because we believe problem is with our specific data object and not the layer_stat_cor
function. Will schedule an DSGS to troubleshoot.
library(SpatialExperiment)
#> Loading required package: SingleCellExperiment
#> Loading required package: SummarizedExperiment
#> Loading required package: MatrixGenerics
#> Loading required package: matrixStats
#>
#> Attaching package: 'MatrixGenerics'
#> The following objects are masked from 'package:matrixStats':
#>
#> colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
#> colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
#> colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
#> colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
#> colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
#> colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
#> colWeightedMeans, colWeightedMedians, colWeightedSds,
#> colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
#> rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
#> rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
#> rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
#> rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
#> rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
#> rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
#> rowWeightedSds, rowWeightedVars
#> Loading required package: GenomicRanges
#> Loading required package: stats4
#> Loading required package: BiocGenerics
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, append, as.data.frame, basename, cbind, colnames,
#> dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
#> grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
#> order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
#> rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
#> union, unique, unsplit, which.max, which.min
#> Loading required package: S4Vectors
#>
#> Attaching package: 'S4Vectors'
#> The following objects are masked from 'package:base':
#>
#> expand.grid, I, unname
#> Loading required package: IRanges
#> Loading required package: GenomeInfoDb
#> Loading required package: Biobase
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#>
#> Attaching package: 'Biobase'
#> The following object is masked from 'package:MatrixGenerics':
#>
#> rowMedians
#> The following objects are masked from 'package:matrixStats':
#>
#> anyMissing, rowMedians
library(here)
#> here() starts at /private/var/folders/5f/b47_7fd14cl_k8lznwjbf1nw0000gn/T/Rtmpi8lHL1/reprex-42df657ff646-awake-topi
library(spatialLIBD)
library(rafalib)
library(scuttle)
library(limma)
#>
#> Attaching package: 'limma'
#> The following object is masked from 'package:BiocGenerics':
#>
#> plotMA
library(RColorBrewer)
library(lattice)
library(edgeR)
#>
#> Attaching package: 'edgeR'
#> The following object is masked from 'package:SingleCellExperiment':
#>
#> cpm
results_specificity <- readRDS(file = "Downloads/results_specificity.RDS")
#> Warning in gzfile(file, "rb"): cannot open compressed file 'Downloads/
#> results_specificity.RDS', probable reason 'No such file or directory'
#> Error in gzfile(file, "rb"): cannot open the connection
modeling_results = fetch_data(type = "modeling_results")
#> snapshotDate(): 2022-04-26
#> adding rname 'https://www.dropbox.com/s/se6rrgb9yhm5gfh/Human_DLPFC_Visium_modeling_results.Rdata?dl=1'
#> 2022-05-25 12:48:26 loading file /Users/abbyspangler/Library/Caches/org.R-project.R/R/BiocFileCache/b0511a989378_Human_DLPFC_Visium_modeling_results.Rdata%3Fdl%3D1
cor <- layer_stat_cor(
results_specificity,
modeling_results,
model_type = names(modeling_results)[2],
reverse = FALSE,
top_n = NULL
)
#> Error in rownames(stats): object 'results_specificity' not found
Created on 2022-05-25 by the reprex package (v2.0.1)
You have an error here:
results_specificity <- readRDS(file = "Downloads/results_specificity.RDS")
#> Warning in gzfile(file, "rb"): cannot open compressed file 'Downloads/
#> results_specificity.RDS', probable reason 'No such file or directory'
#> Error in gzfile(file, "rb"): cannot open the connection
Maybe you meant to use "~/Downloads/results_specificity.RDS"
for the path?
Next Steps: Make sure pseudobulking is being done correctly. Group variable for filterByExpr() Add in anova and pairwise stats. Function within a function. Can f_merge be added to spatialLIBD? Confirm that the output that we want to return should include p-values and fdr's.
New functions as part of version 1.9.12 resolve this issue. We'll test them on a few datasets, which could lead to some changes.
It'll be easier to run the spatial registration process in your own data if we have functions for doing something like https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/Layer_Guesses/ad_snRNAseq_recast.R where we pseudo-bulk the data then compute t-stats for one cluster vs all others at a time, aka, the enrichment statistics. https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/Layer_Guesses/layer_specificity.R has the code for the whole set: anova and pairwise. Then https://github.com/LieberInstitute/HumanPilot/blob/master/Analysis/Layer_Guesses/misc_numbers.R re-shapes the results into what we can load with
fetch_data("modeling_results")
.This would help @abspangler13 @Nick-Eagles @sparthib and others.