bnprks / BPCells

Scaling Single Cell Analysis to Millions of Cells
https://bnprks.github.io/BPCells
Other
166 stars 17 forks source link

[r] Adjust default parameters for `cluster_graph_leiden()` #147

Closed bnprks closed 3 weeks ago

bnprks commented 4 weeks ago

Currently, cluster_graph_leiden() by default will output a number of clusters that scales approximately linearly with number of cells if the resolution parameter is held constant. This is generally not good and leads to problems like this where people get thousands of clusters called on large datasets.

This pull request does the following:

Here is the benchmarking data to justify this change. Note that Leiden modularity with resolution = 1 gives consistent cluster sizes just like Louvain, but Leiden CPM will give out a ton of clusters for large datasets unless the resolution parameter is adjusted down for large datasets.

resolution_plot

cluster-resolution.csv

Click for plotting code ```r data |> mutate(alg=case_match(alg, "leiden" ~ "Leiden CPM", "leiden-modularity" ~ "Leiden Modularity", "louvain" ~ "Louvain"), resolution=factor(as.numeric(resolution), sort(unique(as.numeric(resolution))))) |> ggplot(aes(cells, clusts, color=resolution)) + geom_line() + geom_point() + scale_x_continuous(transform="log10", guide=guide_axis_logticks(), labels=scales::label_log(), breaks=c(1e5, 1e6)) + scale_y_continuous(transform="log10", guide=guide_axis_logticks()) + scale_color_manual(values=RColorBrewer::brewer.pal(9, "BuPu")[3:9]) + facet_wrap("alg") + theme_bw() + coord_fixed() + labs(title="Cluster counts by resolution", y="Cluster count", x="Dataset size (cells)") ```
immanuelazn commented 3 weeks ago

A beautiful graph and test! Changes all are agreeable. Do you have any opinions on Seurat, with how they are implementing leiden clustering? They default on using RBConfigurationVertexPartition rather than modularity. This isn't directly available through igraph, but it is through the leiden python package. Also, I wonder why CPM is the default for igraph.

bnprks commented 3 weeks ago

From the docs, it looks like RBConfigurationVertexPartition is basically the same objective function as modularity just with some constant scaling, so I believe this is consistent with the approach of Seurat