Allow to stack similar images together in a batch to validate them (à la morphocluster)

jiho commented 4 years ago

Allow to group objects by similarity, show a few and allow people to validate batches of objects.

Need a new status for such batch validations.

grololo06 commented 4 years ago

Some background here: https://www.seanoe.org/data/00618/73002/

grololo06 commented 4 years ago

Linked with #228 I guess.

jiho commented 4 years ago

The process would be:

In the project menu, there is a "Group similar images" command; when clicked HDBSCAN is run with sensible settings on object features (including CNN ones or even exclusively on those) to create clusters; those sensible settings include setting min_cluster_size based on the number of predicted+unknown images left in the project (which means that, in a large project, groups will be large at the beginning and small at the end, which is what one wants). This creates groups + a root which are all images not similar enough to be placed in a group.
When new data is imported, groups need to be redone; if, within a project, <10% of objects are not in a group (because they are new), they are assigned to existing clusters through a nearest neighbour algo, if >10%, HDBSCAN is run again from scratch.
In the classif page, a toggle is added at the top with a meaningful icon and the hover label "Show groups of images". When clicked, the UI changes from showing individual images to showing tiles with 4 to 9 images taken randomly in each group. Each tile has a proposed category which is the category of the majority of the members + a score which is the proportion of objects in that group which have that category predicted (we cannot use the average score since we don't store the score for each object to be in each category).
The tiles can be clicked and a modal opens with all images from the group, scrollable; objects are selectable and actions buttons allow to exclude them from the group (and put back at the root), extract them in a new group; a general button stating "Dismiss group" allows to destroy the group and put all its images back at the root.
In the classif window, the UI behaves as with single images: drag and drop, assign to category etc. except the whole group is validated, not a single image; the validation gives a status "Batch validated" or "group validated" rather than validated to differentiate it from images seen individually.

moi90 commented 3 years ago

That would be a really nice addition to EcoTaxa!

The crux is in the details (as always...). It is very hard to chose "sensible settings" for any clustering algorithm, because what is sensible strongly depends on the dataset (size, distribution, ...).

I guess what you have in mind is to chose min_cluster_size so that each cluster should receive (say) 100 objects (so that n_clusters = (n_predicted + n_unknown) / 100). The problem is that min_cluster_size and n_clusters are very hard to relate. Maybe k-means is better suited (but I would choose rather small clusters to make them homogeneous).

Constrained k-means could take into account samples excluded from a group, but this would only work, if a user would also approve groups as homogeneous (negative examples only make sense if you also have positives examples for a cluster). I suspect, you don't want to do this but rather directly validate it as a taxonomic category, right?

jiho commented 3 years ago

That would be a really nice addition to EcoTaxa!

The crux is in the details (as always...). It is very hard to chose "sensible settings" for any clustering algorithm, because what is sensible strongly depends on the dataset (size, distribution, ...).

I guess what you have in mind is to chose min_cluster_size so that each cluster should receive (say) 100 objects (so that n_clusters = (n_predicted + n_unknown) / 100). The problem is that min_cluster_size and n_clusters are very hard to relate. Maybe k-means is better suited (but I would choose rather small clusters to make them homogeneous).

I like the (H)DBSCAN property that not everything is in a cluster, only the dense parts are. The idea, in EcoTaxa, would be that you get the dense (i.e. very similar images) out very fast at the beginning, thanks to the grouping; then you switch back to regular mode (you could call that turtle mode 😉 ) with individual images.

My idea for the heuristic is that, with a 1M objects project, you probably want clusters of >10,000 objects, with a 100,000 objects project you want >1000, etc. n_clusters would be unconstrained (which I think is the case with DBSCAN too right?). Once that ratio is set (by trial and error while implementing the feature: super strong theoretical justification!) it would probably perform as one expects either across projects (big projects = big clusters, small projects = small clusters) or within a project in time (start = big clusters, end = small clusters).

Constrained k-means could take into account samples excluded from a group, but this would only work, if a user would also approve groups as homogeneous (negative examples only make sense if you also have positives examples for a cluster). I suspect, you don't want to do this but rather directly validate it as a taxonomic category, right?

Yes, one would validate clusters in a category. One could also explode clusters but I suspect this will happen less often and those would just be left as is and then people would switch back to the individual object mode.

ecotaxa / ecotaxa_front

Allow to stack similar images together in a batch to validate them (à la morphocluster) #336