ecotaxa / ecotaxa_front

Front end of the EcoTaxa application
Other
6 stars 6 forks source link

Add a way to highlight "weird" objects in a category #160

Open jiho opened 6 years ago

jiho commented 6 years ago

This would help to check already validated samples for potential errors.

via Rainer.

jiho commented 6 years ago

I see two ways of doing this:

  1. with what we currently have, perform a reprediction of the validated objects and extract the score for the object to be in the category it is currently classified in. Use this a way to highlight typical (high score) or weird (low score objects).

  2. Implement a similarity measure e.g. compute PCA on features used for classification and sort by first axis.

picheral commented 6 years ago

Temporisé. Utiliser DUBIOUS en attendant car implémenté et utilisable plus facilement.

grololo06 commented 4 years ago

I would need to know on which functions this idea applies.

jiho commented 4 years ago

Temporisé. Utiliser DUBIOUS en attendant car implémenté et utilisable plus facilement.

Ce n'est pas une opération manuelle, mais un outil permettant justement de détecter les choses douteuses.

jiho commented 4 years ago

Thinking about this more, a similarity index works well for things close to the reference but not for the rest.

So I think the best solution is the score-based approach.

This is convoluted but uses existing tools. I don't see a way of making it fast enough that is becomes interactive (like, right-click on a category and get an option "Show me weird objects first").

rkiko commented 4 years ago

This should contribute to the development of tools to compare sorting of new profiles/samples against this core set.

I think that especially finding the closest matching 10 (x) images (using deep learning features) in a gold dataset could be a way to get an idea if the image under question was sorted properly [...]. Matches should be found in different profiles, or different geographical regions to avoid finding only very similar images in just one profile that was not sorted well.

I think we should first experiment how the similarity searches can be done before we implement it in Ecotaxa. First step might be to get a match score for individual images, then calculating a „profile/sample match against gold dataset" score. This could then be calculated externally and only the score loaded (or used to grow the gold dataset). Good student project!

Finding weird stuff and finding matches seems two sides of the same coin. Although matches might be faster, as one can stop the search after a certain number of matches.

Matching is not the same as prediction or implementing MorphoCluster, I think… If you don’t find a match, you don’t have a prediction, which might be good, it does not force things into categories. But it will not help to find similar images in the dataset to be sorted and cluster them. So, at least for novelty detection, MorphoCluster should still be best ;) And finding mismatches = something not belonging into a category would also be importent for the clean up processes of existing sortings.

rkiko commented 4 years ago

Techniques to try:

siamese network

https://users.aalto.fi/~kannalj1/publications/icpr2016.pdf

https://github.com/facebookresearch/faiss (from issue 228 suggested by JO)

rkiko commented 4 years ago

Suggest to merge issues 160 and 228

rkiko commented 4 years ago

Not sure if this could help for the distance calculations:

https://en.wikipedia.org/wiki/Vantage-point_tree

We use it to find the closest coastal location for a given data point in the ocean

rkiko commented 4 years ago

Comment from JO via mail (9.10.2020): """ For this to work, we need the gold standard to be encompassing a lot of the variability indeed. Still, if this worked as well as what you describe above, then this would be a perfect classifier: for a new image, get the 10 most similar images and take a majority vote about their class. The fact that such approaches are not used as classifiers makes me think it may be a bit hit and miss... """

Reply: Maybe, people do not appreciate the ‚miss‘ … Could be that you only can assign 10% of the dataset that way … Or it is too slow if you really compare all images with all images. But we could use the hierarchy to restrict the search for the ‚matching‘ score. If something is in copepods, we only need to search in copepods … If there is no close match in there, the score is 0

Or this does not work well for natural scenes, color images, images with more variety, few realy similar examples. Could be that plankton data works well for this.