LKremer / scbs

Python package with CLI for the analysis of single cell methylation data. Now known as MethSCAn: https://github.com/anders-biostat/MethSCAn
https://anders-biostat.github.io/MethSCAn/
GNU General Public License v3.0
10 stars 5 forks source link

code for downstream analysis of the VMR matrix #21

Open JiayiLi21 opened 1 month ago

JiayiLi21 commented 1 month ago

Hi developers,

Hope you are doing great! I have tried the methscan workflow and got the VMR_matrix folder, but found it challenging to do imputation and clustering for large-scale data, just wondering would you like to share the script on the down stream processing and clustering based on the VMR methylation profiling, I am trying to do something similar to the Figure S4 in your manuscript: FigS4

Thanks!

LKremer commented 1 month ago

Hi @JiayiLi21 ,

thanks for your question. Just in case you missed it, we have a tutorial that shows an exemplary analysis of a small data set at this link.

The tutorial demonstrates how to use PCA with iterative imputation, but we don't show how to cluster this data or run UMAP on it. I guess we should add this in the future. For now here is a continuation of our tutorial data set that shows you how to cluster cells and get a UMAP based on VMR methylation:

load the data into R and run iterative PCA

This is just the code from our tutorial, copied here for convenience:

library(tidyverse)
library(irlba)

meth_mtx <- read.csv("VMR_matrix/mean_shrunken_residuals.csv.gz", row.names=1) %>%
  as.matrix()

# PCA that iteratively imputes missing values
prcomp_iterative <- function(x, n=10, n_iter=100, min_gain=0.01, ...) {
  mse <- rep(NA, n_iter)
  na_loc <- is.na(x)
  x[na_loc] = 0  # zero is our first guess

  for (i in 1:n_iter) {
    prev_imp <- x[na_loc]  # what we imputed in the previous round
    # PCA on the imputed matrix
    pr <- prcomp_irlba(x, center = F, scale. = F, n = n, ...)
    # impute missing values with PCA
    new_imp <- (pr$x %*% t(pr$rotation))[na_loc]
    x[na_loc] <- new_imp
    # compare our new imputed values to the ones from the previous round
    mse[i] = mean((prev_imp - new_imp) ^ 2)
    # if the values didn't change a lot, terminate the iteration
    gain <- mse[i] / max(mse, na.rm = T)
    if (gain < min_gain) {
      message(paste(c("\n\nTerminated after ", i, " iterations.")))
      break
    }
  }
  pr$mse_iter <- mse[1:i]
  pr
}

pca <- meth_mtx %>%
  scale(center = T, scale = F) %>%
  prcomp_iterative(n = 5, n_iter = 5)

rownames(pca$x) <- rownames(meth_mtx)

pca$x %>% 
  as_tibble() %>% 
  ggplot(aes(x = PC1, y = PC2)) +
  geom_point() +
  coord_fixed()

image

Run UMAP

In single-cell transcriptomics, it's common practice to use the principal components as input for UMAP. We do the same here.

library(uwot)  # R package for UMAP

umap_obj <- uwot::umap(pca$x, min_dist=0.05, n_neighbors=5, ret_nn=T)
umap_tbl <- as_tibble(umap_obj$embedding, rownames="cell_name") %>% 
  dplyr::rename(UMAP1=V1, UMAP2=V2)

umap_tbl %>% 
  ggplot(aes(x = UMAP1, y = UMAP2)) +
  geom_point() +
  coord_fixed()

image

This is our UMAP. For this very simple example data set, it's a little pointless because the three clusters were already very clearly separated on the first two principal components.

Run Leiden clustering

Just like UMAP, Leiden clustering needs a neighbor graph as input. The umap function automatically computes this neighbor graph for us, and we can obtain it from the UMAP object because we specified ret_nn=T. We can run Leiden clustering as follows:

library(igraph)  # R package for graph manipulation, also implements the Leiden algorithm

# get the edges of the neighbor graph from the UMAP object
neighbor_graph_edges <- 
  tibble(from = rep(1:nrow(umap_obj$nn$euclidean$idx),
         times = ncol(umap_obj$nn$euclidean$idx)),
         to = as.vector(umap_obj$nn$euclidean$idx),
         weight = as.vector(umap_obj$nn$euclidean$dist)) %>%
  filter(from != to) %>%
  mutate(from = rownames(umap_obj$embedding)[from],
         to = rownames(umap_obj$embedding)[to])

# run Leiden clustering
clust_obj <- neighbor_graph_edges %>%
  igraph::graph_from_data_frame(directed=F) %>% 
  igraph::cluster_leiden(resolution_parameter = .5)  # adjust the resolution parameter to your needs

# put the clustering results into a data frame (tibble) for plotting
clust_tbl <- tibble(
  leiden = as.character(clust_obj$membership),
  cell_name = clust_obj$names
)

umap_tbl %>% 
  left_join(clust_tbl, by="cell_name") %>% 
  ggplot(aes(x = UMAP1, y = UMAP2, color = leiden)) +
  geom_point()

image

Considerations for real data sets

The tutorial data set is pretty small and its easy to distinguish the three clusters. For real data sets, you might need to adjust some parameters:

  1. The imputation of iterative PCA gets better if you allow for more iterations. Of course this takes more time, but it can also improve results. Try increasing the maximum allowed number of iterations and decreasing the convergence threshold. For a real data set, you usually also want more than 5 PCs. So a more realistic call would be e.g.

    pca <- meth_mtx %>%
    scale(center = T, scale = F) %>%
    prcomp_iterative(n = 15, n_iter=40, min_gain=0.005)

    (This can take quite long to compute, feel free to lower n_iter if it takes too long. Or just set n_iter=100, min_gain=0.001 and let it iterate over lunch break...)

  2. Just like in single-cell transcriptomics, you need to tweak your UMAP hyperparameters. n_neighbors=5 makes sense for our tiny data set, but since you probably have way more cells you have to increase this parameter accordingly.

  3. The built-in R function read.csv is annoyingly slow. I now use fread from the data.table package instead:

    meth_mtx <- "VMR_matrix/mean_shrunken_residuals.csv.gz" %>% 
    data.table::fread(sep=",") %>%
    as.matrix(rownames=1)

Please let me know if this helps, I might add it to the tutorial on the website. Lukas