[r] Adjust default parameters for `cluster_graph_leiden()`

bnprks commented 4 weeks ago

Currently, cluster_graph_leiden() by default will output a number of clusters that scales approximately linearly with number of cells if the resolution parameter is held constant. This is generally not good and leads to problems like this where people get thousands of clusters called on large datasets.

This pull request does the following:

Changes default objective_function from CPM to modularity sets default resolution back to 1.
Moves clustering-related tests to a new file and makes a basic test to confirm that the clustering functions at least don't crash. I don't know of a good way of validating the clustering is working, so not crashing seems good enough for now.

Here is the benchmarking data to justify this change. Note that Leiden modularity with resolution = 1 gives consistent cluster sizes just like Louvain, but Leiden CPM will give out a ton of clusters for large datasets unless the resolution parameter is adjusted down for large datasets.

resolution_plot

cluster-resolution.csv

Click for plotting code

```r data |> mutate(alg=case_match(alg, "leiden" ~ "Leiden CPM", "leiden-modularity" ~ "Leiden Modularity", "louvain" ~ "Louvain"), resolution=factor(as.numeric(resolution), sort(unique(as.numeric(resolution))))) |> ggplot(aes(cells, clusts, color=resolution)) + geom_line() + geom_point() + scale_x_continuous(transform="log10", guide=guide_axis_logticks(), labels=scales::label_log(), breaks=c(1e5, 1e6)) + scale_y_continuous(transform="log10", guide=guide_axis_logticks()) + scale_color_manual(values=RColorBrewer::brewer.pal(9, "BuPu")[3:9]) + facet_wrap("alg") + theme_bw() + coord_fixed() + labs(title="Cluster counts by resolution", y="Cluster count", x="Dataset size (cells)") ```

immanuelazn commented 3 weeks ago

A beautiful graph and test! Changes all are agreeable. Do you have any opinions on Seurat, with how they are implementing leiden clustering? They default on using RBConfigurationVertexPartition rather than modularity. This isn't directly available through igraph, but it is through the leiden python package. Also, I wonder why CPM is the default for igraph.

bnprks commented 3 weeks ago

From the docs, it looks like RBConfigurationVertexPartition is basically the same objective function as modularity just with some constant scaling, so I believe this is consistent with the approach of Seurat

bnprks / BPCells

[r] Adjust default parameters for `cluster_graph_leiden()` #147