Open jiho opened 4 years ago
Some background here: https://www.seanoe.org/data/00618/73002/
Linked with #228 I guess.
The process would be:
That would be a really nice addition to EcoTaxa!
The crux is in the details (as always...). It is very hard to chose "sensible settings" for any clustering algorithm, because what is sensible strongly depends on the dataset (size, distribution, ...).
I guess what you have in mind is to chose min_cluster_size so that each cluster should receive (say) 100 objects (so that n_clusters = (n_predicted + n_unknown) / 100
). The problem is that min_cluster_size
and n_clusters
are very hard to relate.
Maybe k-means is better suited (but I would choose rather small clusters to make them homogeneous).
Constrained k-means could take into account samples excluded from a group, but this would only work, if a user would also approve groups as homogeneous (negative examples only make sense if you also have positives examples for a cluster). I suspect, you don't want to do this but rather directly validate it as a taxonomic category, right?
That would be a really nice addition to EcoTaxa!
The crux is in the details (as always...). It is very hard to chose "sensible settings" for any clustering algorithm, because what is sensible strongly depends on the dataset (size, distribution, ...).
I guess what you have in mind is to chose min_cluster_size so that each cluster should receive (say) 100 objects (so that
n_clusters = (n_predicted + n_unknown) / 100
). The problem is thatmin_cluster_size
andn_clusters
are very hard to relate. Maybe k-means is better suited (but I would choose rather small clusters to make them homogeneous).
I like the (H)DBSCAN property that not everything is in a cluster, only the dense parts are. The idea, in EcoTaxa, would be that you get the dense (i.e. very similar images) out very fast at the beginning, thanks to the grouping; then you switch back to regular mode (you could call that turtle mode 😉 ) with individual images.
My idea for the heuristic is that, with a 1M objects project, you probably want clusters of >10,000 objects, with a 100,000 objects project you want >1000, etc. n_clusters
would be unconstrained (which I think is the case with DBSCAN too right?). Once that ratio is set (by trial and error while implementing the feature: super strong theoretical justification!) it would probably perform as one expects either across projects (big projects = big clusters, small projects = small clusters) or within a project in time (start = big clusters, end = small clusters).
Constrained k-means could take into account samples excluded from a group, but this would only work, if a user would also approve groups as homogeneous (negative examples only make sense if you also have positives examples for a cluster). I suspect, you don't want to do this but rather directly validate it as a taxonomic category, right?
Yes, one would validate clusters in a category. One could also explode clusters but I suspect this will happen less often and those would just be left as is and then people would switch back to the individual object mode.
Allow to group objects by similarity, show a few and allow people to validate batches of objects.
Need a new status for such batch validations.