epigen / unsupervised_analysis

A general purpose Snakemake workflow to perform unsupervised analyses (dimensionality reduction & cluster analysis) and visualizations of high-dimensional data.
MIT License
20 stars 3 forks source link

implement parameter-free classifier-approach to clustering #9

Closed sreichl closed 1 year ago

sreichl commented 1 year ago
sreichl commented 1 year ago
sreichl commented 1 year ago

Pseudocode

sreichl commented 1 year ago
sreichl commented 12 months ago

TESTING NOTES - COMPARE result with ground truth

STOPPING USING max edge weight of crossprediction graph

with 100 trees and 0.025 ie 2.5% as max edge weight cut off -> does that mean 5% of cells want to go from cluster A to B or vice-versa? ARI 0.8619293526483519 NMI 0.8905384726712815

with 1000 trees and 0.025 ie 2.5% as max edge weight cut off -> does that mean 5% of cells want to go from cluster A to B or vice-versa? ARI 0.8738051377073799 NMI 0.8995254588822036

with 5000 trees and 0.025 ie 2.5% as max edge weight cut off -> does that mean 5% of cells want to go from cluster A to B or vice-versa? ARI 0.8689756028623556 NMI 0.8984957984189852

STOPPING USING ACCURACY

with 100 trees and 0.975 acc ARI 0.8570580695592698 NMI 0.898311524053469

with 1000 trees and 0.975 acc ARI 0.82217666507298 NMI 0.8769193958701921

with 5000 trees and 0.975 acc ARI 0.853381198544215 NMI 0.8879365153842752

alternative stopping criteria/strategies

candidates: max_weight, accuracy, f1_score, or a change in accuracy belo e.g., 0.05% Check if the accuracy threshold is met accuracy = accuracy_score(labels, new_labels) print(f"Accuracy: {accuracy}") f1 = f1_score(labels, new_labels, average='weighted') print(f"F1: {f1}")

bednarsky commented 9 months ago

@sreichl In addition to adding a stopping criterion and a recommended clustering, could you keep the clusterings at each merging step and return an interactive plot with a slider where the accuracy of the last merger is shown as text in the corner? So one could check how the clusters look for different thresholds and potentially pick one that looks good.

Of course then not directly comparable via indices, but might be a good thing to troubleshoot if only one big cluster is left and might not be too expensive to store the labels and recolor the UMAP.

sreichl commented 9 months ago

thanks, added it to #28