Open jiho opened 6 years ago
I see two ways of doing this:
with what we currently have, perform a reprediction of the validated objects and extract the score for the object to be in the category it is currently classified in. Use this a way to highlight typical (high score) or weird (low score objects).
Implement a similarity measure e.g. compute PCA on features used for classification and sort by first axis.
Temporisé. Utiliser DUBIOUS en attendant car implémenté et utilisable plus facilement.
I would need to know on which functions this idea applies.
Temporisé. Utiliser DUBIOUS en attendant car implémenté et utilisable plus facilement.
Ce n'est pas une opération manuelle, mais un outil permettant justement de détecter les choses douteuses.
Thinking about this more, a similarity index works well for things close to the reference but not for the rest.
So I think the best solution is the score-based approach.
This is convoluted but uses existing tools. I don't see a way of making it fast enough that is becomes interactive (like, right-click on a category and get an option "Show me weird objects first").
This should contribute to the development of tools to compare sorting of new profiles/samples against this core set.
I think that especially finding the closest matching 10 (x) images (using deep learning features) in a gold dataset could be a way to get an idea if the image under question was sorted properly [...]. Matches should be found in different profiles, or different geographical regions to avoid finding only very similar images in just one profile that was not sorted well.
I think we should first experiment how the similarity searches can be done before we implement it in Ecotaxa. First step might be to get a match score for individual images, then calculating a „profile/sample match against gold dataset" score. This could then be calculated externally and only the score loaded (or used to grow the gold dataset). Good student project!
Finding weird stuff and finding matches seems two sides of the same coin. Although matches might be faster, as one can stop the search after a certain number of matches.
Matching is not the same as prediction or implementing MorphoCluster, I think… If you don’t find a match, you don’t have a prediction, which might be good, it does not force things into categories. But it will not help to find similar images in the dataset to be sorted and cluster them. So, at least for novelty detection, MorphoCluster should still be best ;) And finding mismatches = something not belonging into a category would also be importent for the clean up processes of existing sortings.
Techniques to try:
siamese network
https://users.aalto.fi/~kannalj1/publications/icpr2016.pdf
https://github.com/facebookresearch/faiss (from issue 228 suggested by JO)
Suggest to merge issues 160 and 228
Not sure if this could help for the distance calculations:
https://en.wikipedia.org/wiki/Vantage-point_tree
We use it to find the closest coastal location for a given data point in the ocean
Comment from JO via mail (9.10.2020): """ For this to work, we need the gold standard to be encompassing a lot of the variability indeed. Still, if this worked as well as what you describe above, then this would be a perfect classifier: for a new image, get the 10 most similar images and take a majority vote about their class. The fact that such approaches are not used as classifiers makes me think it may be a bit hit and miss... """
Reply: Maybe, people do not appreciate the ‚miss‘ … Could be that you only can assign 10% of the dataset that way … Or it is too slow if you really compare all images with all images. But we could use the hierarchy to restrict the search for the ‚matching‘ score. If something is in copepods, we only need to search in copepods … If there is no close match in there, the score is 0
Or this does not work well for natural scenes, color images, images with more variety, few realy similar examples. Could be that plankton data works well for this.
This would help to check already validated samples for potential errors.
via Rainer.