MarioniLab / miloR

R package implementation of Milo for testing for differential abundance in KNN graphs
https://bioconductor.org/packages/release/bioc/html/miloR.html
GNU General Public License v3.0
342 stars 22 forks source link

Errors when using custom adjacency matrix for makeNhoods or DA testing #172

Closed andrewjkwok closed 3 years ago

andrewjkwok commented 3 years ago

Hi,

Following up from #167, I manage to successfully build a graph for my Milo object using a custom adjacency matrix. However, I continue to run into a couple of problems downstream.

First, when running from Rstudio server, there is an error Error in replCmat4(x, i1 = if (iMi) 0:(di[1] - 1L) else .ind.prep2(i, : too many replacement values when trying to run makeNhoods as in #107 (when running as a job this error does not seem to appear). The traceback is as follows:

5: stop("too many replacement values")
4: replCmat4(x, i1 = if (iMi) 0:(di[1] - 1L) else .ind.prep2(i, 
       1, di, dn), i2 = if (jMi) 0:(di[2] - 1L) else .ind.prep2(j, 
       2, di, dn), iMi = iMi, jMi = jMi, value = value)
3: `[<-`(`*tmp*`, as_ids(neighbors(graph, v = sampled_vertices[X])), 
       X, value = 1)
2: `[<-`(`*tmp*`, as_ids(neighbors(graph, v = sampled_vertices[X])), 
       X, value = 1)
1: makeNhoods(sce_milo, prop = 0.1, k = 20, d = 30, refined = TRUE, 
       reduced_dims = "TOTALVI")

As mentioned in #107 , it seems the error is coming from Matrix (https://github.com/cran/Matrix/blob/master/R/Csparse.R).

There is a further issue downstream even if I circumvent the Rstudio problem where when I use a custom adjacency matrix, after DA testing, the spatialFDRs are all NAs. This is more puzzling to me and would be great to understand what's going on!

Many thanks in advance.

MikeDMorgan commented 3 years ago

Hi @andrewjkwok Could you also provide the exact code that generates the error, including the steps involving graph building. The error suggests that none of the IDs that are neighbours of the index cell are in the nh_mat. Could you also check that the rownames of the reduced dimension are set with rownames(reducedDim(sce_milo, "TOTAVI") and the cell IDs are in the colnames(sce_milo).

andrewjkwok commented 3 years ago

Hi - thanks for the quick reply. This is the exact code that generates the error:

sce <- as.SingleCellExperiment(seurat)

# convert singlecellexperiment object into milo object
sce_milo <- Milo(sce)

miloR::graph(sce_milo) <- miloR::graph(buildFromAdjacency(seurat@graphs$RNA_snn, k=20, is.binary = FALSE))

sce_milo@.k <- 20

sce_milo <- makeNhoods(sce_milo, prop = 0.1, k = 20, d=30, refined = TRUE, reduced_dims = "TOTALVI")

At this point in Rstudio server I get the error Error in replCmat4(x, i1 = if (iMi) 0:(di[1] - 1L) else .ind.prep2(i, : too many replacement values.

In terms of the rownames of the reduced dimension, they are all the cellIDs:

> all.equal(rownames(reducedDim(sce_milo, "TOTALVI")), colnames(sce_milo))
[1] TRUE
MikeDMorgan commented 3 years ago

Can you double-check that you have consistent versions between the Rstudio server and the R script job, i.e. the same versions of all packages, and that you are using the same miloR install for both.

I should say that I am not able to reproduce this error with the sub-sample data that you provided, which suggests a discordance between installations.

andrewjkwok commented 3 years ago

Hi @MikeDMorgan - I've checked and the only difference I can find between Rstudio server and the R script job is the locale:

For rstudioserver:

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

And for the script job:

locale:
[1] C

However, one thing I did notice was that my version of miloR in both was miloR_0.99.19 and not 1.x, which is possibly to do with the fact that I'm using R4.0 rather than 4.1? I'm not sure if this is relevant as my most relevant installation was with devtools::install_github("MarioniLab/miloR", ref="binary_fix") (I have double checked this - if I try running this again it says Skipping install of 'miloR' from a github remote, the SHA1 (32ad4b9f) has not changed since last install. Does it look to you like it's an issue of R + MiloR versions?

MikeDMorgan commented 3 years ago

Hi @andrewjkwok This isn't an issue with miloR versions between R 4.0 and 4.1, the current version of miloR is 0.99.19.

Did you run devtools::install_github inside an Rstudio session, or is this run in a separate R session? Are both your Rstudio server and R install pointing to the same package install location? Are you running this in an entirely new Rstudio server session? This currently looks like an Rstudio server issue, rather than an explicit problem with miloR per se.

Regarding the 'NA' in the spatial FDRs - there is an incoming fix for this.

andrewjkwok commented 3 years ago

I ran this in a separate R session as the Rstudio server sessions are on an interactive node with no internet connection. Both my Rstudio server and R install point to the same package install location.

Running as a job does seem to be fine so as long as the NAs in spatial FDRs is resolved I suppose it should be ok!

MikeDMorgan commented 3 years ago

That is a conundrum - unfortunately without access to an identical Rstudio server it is impossible to debug. In the meantime, the latest update to the devel branch should fix the NAs issue, so please go ahead and install with `devtools::install_github("MarioniLab/miloR", ref="devel").

Thanks.

andrewjkwok commented 3 years ago

Thanks @MikeDMorgan - I just installed the latest version with devtools::install_github("MarioniLab/miloR", ref="devel") but am still getting the NAs for some reason it seems?

Here is the code I've been using:

message("Making SCE and Milo objects...")
sce <- as.SingleCellExperiment(s_obj)
sce_milo <- Milo(sce)
rm(sce)

#####
# build graph
#####

ndim <- ncol(Embeddings(s_obj, reduction = "TotalVI"))

message("Building graph for Milo.")
if (opt$custom_adjmat == TRUE) {
  message("Using custom adjacency matrix from Seurat SNN.")
  miloR::graph(sce_milo) <- miloR::graph(buildFromAdjacency(s_obj@graphs$RNA_snn, k=opt$nn, is.binary = FALSE))
} else if (opt$custom_adjmat == FALSE) {
  message("Buildling kNN graph from scratch w/ Milo.")
  sce_milo <- buildGraph(sce_milo, k=opt$nn, d = ndim, reduced.dim = "TOTALVI")
}

#####
# end of graph building
#####

if (is.null(sce_milo@.k)) {
  sce_milo@.k <- opt$nn
}

sce_milo <- makeNhoods(sce_milo, prop = opt$prop, k=opt$nn, d=ndim, refined=TRUE, reduced_dims="TOTALVI")
sce_milo <- countCells(sce_milo, meta.data = data.frame(colData(sce_milo)), sample="sample_id")

design_mat <- data.frame(colData(sce_milo))[,c("sample_id", "source", "batch", "age", "sex")]
design_mat <- distinct(design_mat)
rownames(design_mat) = design_mat$sample_id
design_mat$age <- as.numeric(design_mat$age)

message("Calculating neighbourhood distances...")
sce_milo <- calcNhoodDistance(sce_milo, d=ndim, reduced.dim="TOTALVI")

message("DA testing commencing...")
da_results_TMM <- testNhoods(sce_milo, design = ~ source, design.df = design_mat,
                         norm.method='TMM',
                         reduced.dim = 'TOTALVI')

da_results_TMM %>%
  arrange(-SpatialFDR) %>%
  head()

And when I use the custom adjacency matrix, from that I get:

        logFC   logCPM            F    PValue       FDR Nhood SpatialFDR
1 -2.84600455 7.931401 1.5378160065 0.2149456 0.5412378     1         NA
2  0.51103321 7.508991 0.0671836469 0.7954828 0.9225643     2         NA
3  0.03131574 7.805346 0.0002000211 0.9887160 0.9950202     3         NA
4 -0.59308597 7.694039 0.0970488286 0.7554009 0.9036454     4         NA
5  0.46955999 7.446743 0.0512238026 0.8209477 0.9332992     5         NA
6 -0.37194645 7.313815 0.0398217945 0.8418297 0.9411815     6         NA

Did I use install using the wrong branch?

MikeDMorgan commented 3 years ago

Hi @andrewjkwok Could you report the output of the following blocks of code please?

sum(is.na(reducedDim(sce_milo, "TOTALVI")))

graphSpatialFDR(x.nhoods=nhoods(sce_milo), graph=miloR::graph(sce_milo), weighting="k-distance", k=sce_milo@.k,
                            pvalues=dat_results_TMM$PValue, indices=nhoodIndex(sce_milo),
                            distances=nhoodDistances(sce_milo), reduced.dimensions=reducedDim(sce_milo, "TOTALV"))

graphSpatialFDR(x.nhoods=nhoods(sce_milo), graph=miloR::graph(sce_milo), weighting="neighbour-distance", k=sce_milo@.k,
                            pvalues=dat_results_TMM$PValue, indices=nhoodIndex(sce_milo),
                            distances=nhoodDistances(sce_milo), reduced.dimensions=reducedDim(sce_milo, "TOTALV"))

graphSpatialFDR(x.nhoods=nhoods(sce_milo), graph=miloR::graph(sce_milo), weighting="max", k=sce_milo@.k,
                            pvalues=dat_results_TMM$PValue, indices=nhoodIndex(sce_milo),
                            distances=nhoodDistances(sce_milo), reduced.dimensions=reducedDim(sce_milo, "TOTALV"))

graphSpatialFDR(x.nhoods=nhoods(sce_milo), graph=miloR::graph(sce_milo), weighting="none", k=sce_milo@.k,
                            pvalues=dat_results_TMM$PValue, indices=nhoodIndex(sce_milo),
                            distances=nhoodDistances(sce_milo), reduced.dimensions=reducedDim(sce_milo, "TOTALV"))
andrewjkwok commented 3 years ago

Hi @MikeDMorgan. The results are:

> sum(is.na(reducedDim(sce_milo, "TOTALVI")))
[1] 0

For the 4 commands, sequentially,:

graphSpatialFDR(x.nhoods=nhoods(sce_milo), graph=miloR::graph(sce_milo), weighting="k-distance", k=sce_milo@.k,
                pvalues=da_results_TMM$PValue, indices=nhoodIndex(sce_milo),
                distances=nhoodDistances(sce_milo), reduced.dimensions=reducedDim(sce_milo, "TOTALVI"))

gives all NAs.

graphSpatialFDR(x.nhoods=nhoods(sce_milo), graph=miloR::graph(sce_milo), weighting="neighbour-distance", k=sce_milo@.k,
                pvalues=da_results_TMM$PValue, indices=nhoodIndex(sce_milo),
                distances=nhoodDistances(sce_milo), reduced.dimensions=reducedDim(sce_milo, "TOTALVI"))

gives all numbers.

graphSpatialFDR(x.nhoods=nhoods(sce_milo), graph=miloR::graph(sce_milo), weighting="max", k=sce_milo@.k,
                pvalues=da_results_TMM$PValue, indices=nhoodIndex(sce_milo),
                distances=nhoodDistances(sce_milo), reduced.dimensions=reducedDim(sce_milo, "TOTALVI"))

Also gives all numbers.

And finally,

graphSpatialFDR(x.nhoods=nhoods(sce_milo), graph=miloR::graph(sce_milo), weighting="none", k=sce_milo@.k,
                pvalues=da_results_TMM$PValue, indices=nhoodIndex(sce_milo),
                distances=nhoodDistances(sce_milo), reduced.dimensions=reducedDim(sce_milo, "TOTALVI"))

gives all NAs again.

In fact I see now on reading more of the documentation https://rdrr.io/github/MarioniLab/miloR/man/testNhoods.html that the default for fdr.weighting is k-distance which may be why I'm getting the NAs. I can of course just use either neighbor-distance or max, but reading the manuscript, including the supplementary notes, it seems that k-distance should be the method of choice so is there a fix for it? I'm also wondering whether this is related to the fact that sce_milo@.k still isn't automatically set when I use a custom adjacency matrix and I manually set it?

MikeDMorgan commented 3 years ago

I'm also wondering whether this is related to the fact that sce_milo@.k still isn't automatically set when I use a custom adjacency matrix and I manually set it?

If you read the documentation for buidlFromAdjacency it returns a Milo object, which is where .k is set, so in your code above you only extract the graph directly from this new Milo object and assign it to a different Milo object. There is no reason for .k to be set on your original Milo object, and thus it is unrelated to this issue.

Could you share a minimally reproducible dataset so that I can try to reproduce this problem please?

andrewjkwok commented 3 years ago

My mistake regarding .k - thank you for the pointer. I've shared a minimally reproducible dataset, please let me know how it looks.

MikeDMorgan commented 3 years ago

Hi @andrewjkwok I have diagonsed the issue - if you inspect the cell x nhood matrix nhoods() you will see that some of the nhoods contain < k cells, e.g. sum(colSums(nhood(sce_milo)) < 20). This suggests an issue in the graph that is used as input, as a neighbourhood by definition should have a minimum of k nodes.

This can be corroborated by also inspecting the input graph from Seurat:

inferr.graph <- igraph::graph_from_adjacency_matrix((seurat.in@graphs$RNA_snn > 0)+0, mode="undirected", weighted=NULL, diag=TRUE)
sum(igraph::ego_size(inferr.graph, order=1, nodes=igraph::V(inferr.graph)) < 20)
[1] 26

A neighbourhood is defined as the number of nodes connected to any given node, which for a kNN graph has a minimum of k. Therefore, there is an issue with the adjacency matrix from Seurat. Consequently, when this is used as the input to Milo there will be nhoods with < k cells, and the graphSpatialFDR will try to find the distance to the kth NN of the index, which doesn't exist for some nhoods. Hence, the returned FDRs would not make sense as they are a weighting over all nhoods.

My advice would be either construct the graph using Milo, which will guarantee that there are at least k cells in each nhood, or contact the Seurat developers to enquire why the graph leads to < k neighbours.

andrewjkwok commented 3 years ago

Hi @MikeDMorgan - thank you very much for the diagnosis. My first thought would be that it might be because the Seurat graph prunes the nearest neighbours hence fewer than k neighbours, but as you say that will require me to go back to the Seurat developers.

Thanks for the information!