dbmap in reticulate - Githubissues

connorhknight commented 3 years ago

Hi!

Is it possible to use this through reticulate?

Fast approximate nearest neighbour functions well. However, fast adaptive multiscaled diffusion maps, the last part of the code does not work:

ind, dist, grad, graph = diff.ind_dist_grad(data)

which in R i used:

diff$ind_dist_grad(data).

However, for this I receive this error:

Error in py_call_impl(callable, dots$args, dots$keywords) : RuntimeError: 2021-01-04 17:55:21 spacefactory.h:50 (CreateSpace) It looks like the space cosine is not defined for the distance type : FLOAT

Detailed traceback: File "/Users/knight05/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/dbmap/diffusion.py", line 406, in ind_dist_grad ).fit(mms) File "/Users/knight05/Library/r-miniconda/envs/r-reticulate/lib/python3.6/site-packages/dbmap/ann.py", line 145, in fit data_type=nmslib.DataType.DENSE_VECTOR)

Are you able to indicate why this could be happening?

davisidarta commented 3 years ago

Hi @connorhknight! Thank you for trying out dbMAP!

It's completely possible to use dbMAP in R through reticulate. I'll add a tutorial on it. The tricky step here is how reticulate sends information from R to python. The python API often deals with multiple output objects as a tuple, so maybe you could try to create an empty tuple to start with. However, I'd also have to ask which data format you're using (data frames/sparse matrices in R?). This might be what's causing this issue, since your detailed traceback goes back to the NMSLib API included within dbMAP, and NMSLib needs some predefinitions on the search space for ANN. Could you provide a small replication dataset?

While I don't set up an example for the approximate neighbor's gradient, I would suggest you explore the structure components resulting from the multiscale diffusion harmonics and how they correlate to your features. These structure components quite much define the spectra of the manifold itself, on a Laplace-Beltrami operator basis, and could be seen as a gradient of your whole data. The gradient of this operator is a more complicated bit.

Maybe you're doing single-cell analysis in R, perhaps you're using Seurat? I'd be happy to add a Seurat example with reticulate to the docs.

connorhknight commented 3 years ago

Hi!

Thank you for your prompt response!

I started by converting a dense matrix from R to python using reticulate::r_to_py(matrix), then converting to a sparse matrix using the scipy.sparse$csr_matrix(). One thing to note is that I normalised (sctransform), scaled, and centred the matrix beforehand, so I do not know if this could be the issue? I was also wondering if PCA embeddings can also be applied? A Seurat example would be fantastic and would make it much more accessible! I personally use my own framework but would be happy to use it via Seurat.

I have uploaded raw counts matrix of a bone marrow sample containing just over 1000 cells: example_sample.csv.zip

I was also wondering, in the case of batch correction of multiple samples where the algorithm returns a reduced dimension matrix, is that also okay to be applied?

Thanks!

Connor

davisidarta commented 3 years ago

Hi @connorhknight!

A Seurat example would be fantastic and would make it much more accessible! I personally use my own framework but would be happy to use it via Seurat. I'll work on adding this tutorial to the documentation. From now, please take a look at the example by the end of this answer. What we're doing is first is loading some libraries and then importing dbmap and some other python modules with reticulate. Next, we extract the data matrix from the Seurat object and convert it to python in a sparse format.

One thing to note is that I normalised (sctransform), scaled, and centred the matrix beforehand, so I do not know if this could be the issue? Now, this is controversial: there is no consensus within the single-cell community on whether single-cell RNAseq data should be scaled or not, and what is the best way to normalize it. By default, counts are usually log or sqrt normalized and then scaled (centered and mean zero) out of necessity for computing PCA. You can use SCTransform normalized data (it works quite neatly on my experience), but I advise against using scaled data for the diffusion procedure (which, differently than PCA, works on unscaled, raw or normalized data). I've tested 10X's large PBMCs datasets (33k, 66k cells) and they render quite similarly neat results regardless of whether dbMAP was computed on the raw, log-normalized, or SCTransform data, but look quite freaky when given the scaled data.

I was also wondering if PCA embeddings can also be applied? They can, but that is not advised. This because your data could present multiple colinearities which PCA will overlook as it looks for maximum variance hyperplanes, in a process that can result in a quite distorted projection if your data is highly non-linear (i.e. you can't get a structured sense of it by visualizing 2 or 3 PCs; the diffusion components are non-linearly related across different cells). You can get an idea of whether your data is highly non-linear or not by running default analyses (PCA, PCA+UMAP or t-SNE) and comparing them to the diffusion and dbMAP results.

I was also wondering, in the case of batch correction of multiple samples where the algorithm returns a reduced dimension matrix, is that also okay to be applied? Yes, it is! However, a cool idea that could improve your batch-correction results is using the diffusion coordinates instead of the PC coordinates (i.e. Harmony, scanorama, BBKNN) in each object. What the majority of these algorithms do is basically building a nearest-neighbors graph on top of PCA coordinates and try to minimize their divergence.

A key hyperparameter in your case is the number of diffusion components to be computed. Ideally, you should find at least one negative eigenvalue (I've included a plot in the example below). This number can vary wildly depending on your number of samples and the intrinsic geometry of your data, as well as whether you gave it expression data or already reduced data (PCA, integration embeddings, etc). The algorithm tries to find a nice initial guess by itself before it starts optimizing, but I suggest playing with it a bit ( i.e. 50, 100, 200), as this can affect the 'resolution' of the embedding (more components = more detail, less global structure).

The example follows. Please let me know if you have any further questions! I'll add this to the docs as soon as possible. I've included a neat 3D plotting snippet at the end, hope you enjoy it.

###############################################################################
# Run dbMAP in R with reticulate
###############################################################################
library(reticulate)
library(Seurat)
library(plotly)

np <- reticulate::import("numpy")
pd <- reticulate::import("pandas")
sp <- reticulate::import("scipy")
dbmap <- reticulate::import('dbmap')

# 'dat' is your seurat object
data <- YourSeuratObject

data <- t(dat@assays$AssayYouWouldLikeToUse@data[VariableFeatures(dat),])   #Your cell vs. HVGs here
a <- r_to_py(data)
b <- a$tocsr()
diff <- dbmap$diffusion$Diffusor(n_components = as.integer(100), n_neighbors = as.integer(15),
                                 transitions = as.logical(F),
                                 norm = as.logical(F), ann_dist = as.character('cosine'),
                                 n_jobs = as.integer(10), kernel_use = as.character('simple')
)
diff = diff$fit(b)
sc = as.matrix(diff$transform(b))
res = diff$return_dict()

evals <- py_to_r(res$EigenValues)
plot(evals) #Visualize meaningful diffusion components (with positive eigenvalues).

rownames(sc) <- colnames(dat)
new_names <- list()
for(i in 1:length(colnames(sc))){
  new_names[i] <- paste('SC_' , as.integer(colnames(sc[i])) + 1, sep = '')
}
colnames(sc) <- as.vector(new_names)
names(evals) <- as.vector(new_names)

# Add the new dimensionality reduction. Used PCA slot as model. 
dat@reductions$sc <- dat@reductions$pca 
dat@reductions$sc@cell.embeddings <- as.matrix(sc)
dat@reductions$sc@feature.loadings <- matrix(data = c(0,0))
dat@reductions$sc@key <- 'SC_'

##################################################################################
# Post-diffusion processing (clustering, UMAP)
##################################################################################

dat <- FindNeighbors(dat, reduction = 'sc', dims = 1:ncol(dat@reductions$sc@cell.embeddings), annoy.metric = 'cosine', graph.name = 'dbgraph')
dat <- FindClusters(dat, resolution = 1, graph.name = 'dbgraph', algorithm = 2)

dat <- RunUMAP(dat, reduction = 'sc', dims = 1:ncol(dat@reductions$sc@cell.embeddings), n.neighbors = 15, init = 'spectral',
               min.dist = 0.6, spread = 1.5, learning.rate = 1.5, n.epochs = 500, reduction.key = 'dbMAP_', reduction.name = 'dbmap')

dat <- RunUMAP(dat, reduction = 'sc', n.components = 3, dims = 1:ncol(dat@reductions$sc@cell.embeddings),
               min.dist = 0.8, spread = 1.2, n.epochs = 500, reduction.key = 'dbMAP3D_', reduction.name = 'dbmap3d')

plot.data <- FetchData(object = dat, vars = c("dbMAP3D_1", "dbMAP3D_2", "dbMAP3D_3", 
                                               "seurat_clusters"    #Genes or meta-data variables to color cells by

))
plot.data$label <- dat$seurat_clusters
plot_ly(data = plot.data, 
        x = ~dbMAP3D_1, y = ~dbMAP3D_2, z = ~dbMAP3D_3, 
        color = ~seurat_clusters, 
        type = "scatter3d", 
        mode = "markers", 
        marker = list(size = 3, width=0.5),
        text=~seurat_clusters,
        hoverinfo="text", plot_bgcolor = 'black')

davisidarta commented 3 years ago

I'm closing this issue as this seems to be resolved. Please feel free to re-open it if needed.

davisidarta / dbMAP

dbmap in reticulate #3