MarioniLab / miloR

R package implementation of Milo for testing for differential abundance in KNN graphs
https://bioconductor.org/packages/release/bioc/html/miloR.html
GNU General Public License v3.0
342 stars 22 forks source link

Importing Seurat's WNN #157

Closed mdmanurung closed 3 years ago

mdmanurung commented 3 years ago

Dear authors,

Thank you for writing this package; I really enjoyed it.

Currently, I am trying to import Seurat's WNN results into Milo. I could import the pre-computed graph, however, I am unable to figure out the makeNhoods() step. IIUC, the function refines the neighborhoods in reduced dimensional space. As I can only pick one reduced dimension--either PCA on RNA or ADT--this becomes a bottleneck. Do you have any suggestion?

Thanks in advance.

Regards, Mikhael

emdann commented 3 years ago

Hi @MikhaelManurung Thank you for your interest in Milo! Have you tried adding the graph to the Milo object using the buildFromAdjacency function? This is an example with one of Seurat's toy datasets:

library(Seurat)
library(miloR)
data("pbmc_small")
pbmc_small <- FindNeighbors(pbmc_small)
pbmc_small_sce <- as.SingleCellExperiment(pbmc_small)
pbmc_small_milo <- Milo(pbmc_small_sce)
miloR::graph(pbmc_small_milo) <- miloR::graph(buildFromAdjacency(pbmc_small@graphs$RNA_snn, k=10))
mdmanurung commented 3 years ago

Hi @emdann,

Thanks for your fast response. Actually, those are the steps that I have managed to do so far. I am stuck at the makeNhoods() step since I don't know which reduced dim that I should use.

Regards, Mikhael

emdann commented 3 years ago

Oh I misunderstood the query, apologies for that.

This is an interesting question for application to multi-modal datasets. We recognise that the algorithm we use for search of the neighbourhood indices limits it's application to "graph-based" integration methods such as WNN. In practice, it might be that using only one of the two modalities works just fine to exclude outliers, especially if you select the modality with the highest average weight on the embedding (the RNA in this case?).

To test this, I would recommend comparing (A) neighbourhoods constructed running makeNhoods using the PCA on RNA with the option refined=TRUE and (B) neighbourhoods constructed running makeNhoods with the option refined=FALSE, which just selects random cells as neighbourhoods indices (you can use whichever dimensionality reduction, as this is not used). The refinement algorithm works to reduce the total number of neighbourhoods and to obtain a more tight distribution of neighbourhood sizes (see Suppl. Fig. 1 in our preprint). If you see that the size and number of neighbourhoods refining on the PCA from RNA is not that different from the random selection, then you might have to use another dimensionality reduction.

Alternatively, you could consider using a "latent space-based" multi-modal integration method such as MOFA+ for the differential abundance analysis.

Hope this helps, I'd be interested to hear what works best for you in this scenario.

mdmanurung commented 3 years ago

Hi Emma,

Thanks for the suggestions. I will try out these options and get back to you.

Regards, Mikhael

MikeDMorgan commented 3 years ago

Hi @MikhaelManurung Did Emma's solution resolve your problem? I'll close this issue for now, but please re-open if there is more trouble-shooting required.

diegoalexespi commented 3 years ago

Hi all,

Not sure if this is the appropriate place to follow-up on this, but in addition to makeNhoods(), does calcNhoodDistance() require a reduced dimension space as well? So even setting makeNhoods(refined=FALSE) (or attempting an alternative neighborhood refinement strategy that doesn't rely on reduced dimension values), the calcNhoodDistance() step would still be a bottleneck?

emdann commented 3 years ago

Hi @diegoalexespi, That is correct, calcNhoodDistance() uses the reduced dimension space to calculate distance between the k-th nearest neighbor to the nhood index cell. However, in some initial implementation of Milo we tested using other graph-based connectivity metrics for the weighted FDR procedure, which you could store in the nhoodDistance() slot to circumvent the use of calcNhoodDistance. We dropped the use of these metrics mostly because they were less efficient than distance to the k-th neighbor, but the SpatialFDR correction results were comparable. @MikeDMorgan might be able to give more detailed info on this.

MikeDMorgan commented 3 years ago

Hi @diegoalexespi In terms of time-complexity using either edge- or vertex-weighting was still much, much slower than computing the k-th neighbour distance. At the moment we don't have any graph-distance weighting implemented (these were removed a while ago), but it is something on our development roadmap for Milo.

diegoalexespi commented 3 years ago

Thanks @emdann and @MikeDMorgan for the prompt and informative responses! I appreciate it. I'll keep an eye out on development, and I'll do some digging into graph connectivity measures in the meantime as well.

diegoalexespi commented 3 years ago

Hi! So I was intrigued with the question of utilizing this method for graphs not directly derived from a single distance matrix, such as Seurat's WNN and BBKNN. More specifically, for makeNhoods, I wanted to devise a graph-refinement scheme that depended solely on the kNN graph and NOT any distance metric or measure.

As a possible solution, I chose to refine the random vertex selection in makeNhoods in the following way:

  1. For each random vertex V, obtain an induced subgraph of V and its neighbors.
  2. Assuming the kNN is directed, select the vertex V' within this subgraph with the most in-edges.

I am assuming that in an induced subgraph from a kNN, the vertex with the most in-edges is similar to what would be found form Milo's "refined" algorithm.

I tried replicating the algorithm evaluations in Supplementary Fig. 1A-B, and obtained the following (the above method is distance-independent, denoted as 'independent'). I think results are comparable in this manner. Happy to hear thoughts/discuss more and/or initiate a pull request if such a feature would be beneficial.

Thanks!

image

emdann commented 3 years ago

Hi @diegoalexespi

Wow, thank you for contributing this great work, it looks really promising! This has definitely been a frequently requested feature, so I think it would be a useful addition to the package. I have a few comments, that I am happy to look into myself if you start a pull request:

  1. Testing on a dataset where a latent space is available, I would try to assess that the two refinement schemes converge on a similar set of cells (measuring the amount of overlap/distance between selected indices), and I'd look into which cells are the most different. In particular, I would verify that the overlap/distance remains stable when increasing K.
  2. Have you landed on a metric to use for the SpatialFDR correction?
  3. On BBKNN graphs I would verify that the "coverage" of the neighbourhoods with this procedure doesn't exclude some dataset specific cells (esp. when using the "trimming" option)

If this flies we will probably need to add a vignette for "graph-only" neighbourhood analysis.

diegoalexespi commented 3 years ago

Thanks @emdann !

I'll do some organizing of my code before implementing a pull request (I have to prune a bunch of my commented out code and my print("testing") statements...), but I'll try to do that sooner than later. In the meantime, for some of the comments:

  1. I agree that determining the amount of overlap between the two refinement schemes is important. Is the goal to get "good" overlap between the nhood index identities themselves? Or is it to get "close" from a graph proximity standpoint (i.e. latent space distance or geodesic-type distance)? Another option I initially explored for this question was determining Jaccard overlap between the sets of nhood indices of different initializations of each scheme (within scheme stability) and also determining the Jaccard overlap between the sets of nhood indices of identical initializations of the two schemes (cross-scheme stability).
  2. I believe for the SpatialFDR correction, the latent space is needed to determine the distance of the furthest neighbor from a given nhood index, correct? As a possible alternative, a distance-independent measure that could serve as a proxy could be: (1) calculating the Jaccard similarities between the nhood index and its neighbors on the full graph and then (2) use the minimum Jaccard similarity from its neighbors. I am not sure how well Jaccard similarities negatively correlate with latent space distance, though.
  3. Could you clarify what you mean by the coverage of the neighborhoods? And by "trimming" do you mean the "refined" options?

Thanks!

MikeDMorgan commented 3 years ago

Re-opening this so we can keep tabs on things.

emdann commented 3 years ago

Hi @diegoalexespi

  1. I would check that the indices picked with the two refinement schemes are close in terms of geodesic distance on the graph. Also the Jaccard overlap would be useful, but it's not that crucial that the indices are exactly the same (esp in large datasets) In addition, I would also check that refining reduces the number of indices in the same "high density" regions of the graph.

I believe for the SpatialFDR correction, the latent space is needed to determine the distance of the furthest neighbor from a given nhood index, correct?

that is correct. We are looking into comparing the overlap between neighbourhoods to k-th distance for SpatialFDR.

  1. sorry for using confusing terms, with "coverage" I simply mean where the indices lie: when using the trim option implemented in bbknn.bbknn only the neighbors with top connectivities are retained (so the number of neighbors for each cell is not necessarily equal to k). This is frequently used to make sure the algorithm doesn't force similarity to a very distant set of cells present in a single batch. What I wonder is if these dataset-specific clusters might be treated as outliers by the refinement scheme, hence not sampled, hence ignored in DA analysis
diegoalexespi commented 3 years ago

Added some comparisons in the pull request #166, but question about the following:

I believe for the SpatialFDR correction, the latent space is needed to determine the distance of the furthest neighbor from a given nhood index, correct?

that is correct. We are looking into comparing the overlap between neighbourhoods to k-th distance for SpatialFDR.

Is there a specific way that you are evaluating whether the neighborhood overlap (or other measure) compares "well" to the k-th distance for SpatialFDR? I'd like to try evaluating some measures as well, but unsure how to decide whether a measure is a good replacement for the k-th distance at the moment. Thanks!

MikeDMorgan commented 3 years ago

Closing for now as the discussion has moved to the PR #166