implement parameter-free classifier-approach to clustering

sreichl commented 1 year ago

inspiration from https://github.com/SCCAF/sccaf
can be applied to any of the previous clustering results (always take the one w/ max(#clusters) )
- e.g., Leiden resolution -> use the largest

sreichl commented 1 year ago

[x] brainstorm name for the approach that describes what it does
- automatic/ed clustering: autoClust
- iterative predictions refine clustering by merging (confusion matrix): confuseClust, Clustit
- Clustering using classification: clusification or classtering
- motto: Merge until we converge

sreichl commented 1 year ago

Pseudocode

General-purpose clustering (aggregation/refinement) approach for high-dimensional data with minimal parameters (generalized! not single-cell specific)
1. UMAP to n=2 & 3 (→ only influences visualization)
  - parameters: metric (eg correlation, cosine,... → data-dependent) and neighbours
2. Perform Clustering on UMAP KNN graph, which is dimensionless, to get a lot of clusters ie fine-grained (ie overcluster)
  - parameter: clustering algorithm dependent. use best in class algorithm for graphs. currently: Leiden has “resolution” to be chosen very high eg 2?.
3. classifier approach for the refinement of clustering results ie merging based on confusion matrix until convergence (iterative classification)
  - inspired by SCCAF-Single Cell Clustering Assessment Framework (Teichman lab): https://github.com/SCCAF/sccaf
  - classifier not necessarily logReg, but something less parametric and non-linear eg RF w/ 1-5k trees or gradient boosting machine like xgboost (shown to be best for tabular data) and defaults (interpretability can still be achieved by analyzing most important predictors eg by linear means or differential analysis style post-hoc)
  - make the classifier a configurable variable in the config file? or use heuristic?
  - train & test X times to get average/distribution of train & test(CV) accuracy
  - check if the mean accuracy threshold is met eg 0.99, or user config, or 100 repetitions
    - if stop-condition fulfilled: move on to 4
    - if not fulfilled: merge clusters according to the confusion matrix, save clustering and repeat 3.
4. cluster analysis -> move maybe to separate issue? requires cluster analysis being implemented, should be compatible with Leiden clustering setup (i.e., multiple Leiden clustering results with different resolutions being compared)
  1. apply Clustree R package to all iterations to visualize the clustering evolution
  2. determine cluster indices and apply multi criteria decision making methods to see if final clustering gets best score i.e. revisit master theses for cluster validation approaches
  3. compare metadata with clustering results using external indices eg ARI, NMI, ...

sreichl commented 1 year ago

blog post: https://cbiagii.github.io/post/post_01/
Nature Methods 2020 Teichman SCCAF paper: https://www.nature.com/articles/s41592-020-0825-9

sreichl commented 12 months ago

TESTING NOTES - COMPARE result with ground truth

STOPPING USING max edge weight of crossprediction graph

with 100 trees and 0.025 ie 2.5% as max edge weight cut off -> does that mean 5% of cells want to go from cluster A to B or vice-versa? ARI 0.8619293526483519 NMI 0.8905384726712815

with 1000 trees and 0.025 ie 2.5% as max edge weight cut off -> does that mean 5% of cells want to go from cluster A to B or vice-versa? ARI 0.8738051377073799 NMI 0.8995254588822036

with 5000 trees and 0.025 ie 2.5% as max edge weight cut off -> does that mean 5% of cells want to go from cluster A to B or vice-versa? ARI 0.8689756028623556 NMI 0.8984957984189852

STOPPING USING ACCURACY

with 100 trees and 0.975 acc ARI 0.8570580695592698 NMI 0.898311524053469

with 1000 trees and 0.975 acc ARI 0.82217666507298 NMI 0.8769193958701921

with 5000 trees and 0.975 acc ARI 0.853381198544215 NMI 0.8879365153842752

alternative stopping criteria/strategies

candidates: max_weight, accuracy, f1_score, or a change in accuracy belo e.g., 0.05% Check if the accuracy threshold is met accuracy = accuracy_score(labels, new_labels) print(f"Accuracy: {accuracy}") f1 = f1_score(labels, new_labels, average='weighted') print(f"F1: {f1}")

bednarsky commented 9 months ago

@sreichl In addition to adding a stopping criterion and a recommended clustering, could you keep the clusterings at each merging step and return an interactive plot with a slider where the accuracy of the last merger is shown as text in the corner? So one could check how the clusters look for different thresholds and potentially pick one that looks good.

Of course then not directly comparable via indices, but might be a good thing to troubleshoot if only one big cluster is left and might not be too expensive to store the labels and recolor the UMAP.

sreichl commented 9 months ago

thanks, added it to #28

epigen / unsupervised_analysis