ecotaxa / ecotaxa_front

Front end of the EcoTaxa application
Other
6 stars 6 forks source link

Allow to stack similar images together in a batch to validate them (à la morphocluster) #336

Open jiho opened 4 years ago

jiho commented 4 years ago

Allow to group objects by similarity, show a few and allow people to validate batches of objects.

Need a new status for such batch validations.

grololo06 commented 4 years ago

Some background here: https://www.seanoe.org/data/00618/73002/

grololo06 commented 4 years ago

Linked with #228 I guess.

jiho commented 4 years ago

The process would be:

moi90 commented 3 years ago

That would be a really nice addition to EcoTaxa!

The crux is in the details (as always...). It is very hard to chose "sensible settings" for any clustering algorithm, because what is sensible strongly depends on the dataset (size, distribution, ...).

I guess what you have in mind is to chose min_cluster_size so that each cluster should receive (say) 100 objects (so that n_clusters = (n_predicted + n_unknown) / 100). The problem is that min_cluster_size and n_clusters are very hard to relate. Maybe k-means is better suited (but I would choose rather small clusters to make them homogeneous).

Constrained k-means could take into account samples excluded from a group, but this would only work, if a user would also approve groups as homogeneous (negative examples only make sense if you also have positives examples for a cluster). I suspect, you don't want to do this but rather directly validate it as a taxonomic category, right?

jiho commented 3 years ago

That would be a really nice addition to EcoTaxa!

The crux is in the details (as always...). It is very hard to chose "sensible settings" for any clustering algorithm, because what is sensible strongly depends on the dataset (size, distribution, ...).

I guess what you have in mind is to chose min_cluster_size so that each cluster should receive (say) 100 objects (so that n_clusters = (n_predicted + n_unknown) / 100). The problem is that min_cluster_size and n_clusters are very hard to relate. Maybe k-means is better suited (but I would choose rather small clusters to make them homogeneous).

I like the (H)DBSCAN property that not everything is in a cluster, only the dense parts are. The idea, in EcoTaxa, would be that you get the dense (i.e. very similar images) out very fast at the beginning, thanks to the grouping; then you switch back to regular mode (you could call that turtle mode 😉 ) with individual images.

My idea for the heuristic is that, with a 1M objects project, you probably want clusters of >10,000 objects, with a 100,000 objects project you want >1000, etc. n_clusters would be unconstrained (which I think is the case with DBSCAN too right?). Once that ratio is set (by trial and error while implementing the feature: super strong theoretical justification!) it would probably perform as one expects either across projects (big projects = big clusters, small projects = small clusters) or within a project in time (start = big clusters, end = small clusters).

Constrained k-means could take into account samples excluded from a group, but this would only work, if a user would also approve groups as homogeneous (negative examples only make sense if you also have positives examples for a cluster). I suspect, you don't want to do this but rather directly validate it as a taxonomic category, right?

Yes, one would validate clusters in a category. One could also explode clusters but I suspect this will happen less often and those would just be left as is and then people would switch back to the individual object mode.