So if we accept for the sake of argument the idea that fine-grained-visual-categorisation is a different task to generic object recognition and needs different algorithms as in this Diettrich paper.
CNNs seem to be SoA for object recognition in natural images but FGVC might be better for this sort of task (Chris believes so, Diettrich's work is very similar problem and focusses on this kind of algorithm).
What if we try to combine these approaches:
train a classical CNN on the dataset
cluster the CNN output matrices for the images
apply FGVC methods to each of these subsets of images e.g. descriptor + dictionary + LLC encoding + max pooling + linear SVM (as a similar task identifying insects found this performed better than dictionary-free SET methods poster)
This sort of divide and conquer approach might be awful but intuitively it seems a reasonable thing to at least try and optimistically might combine the best of both worlds. We could also consider modifying the class labels if obvious superclasses (maybe even using those provided by kaggle) emerge at this clustering step.
So if we accept for the sake of argument the idea that fine-grained-visual-categorisation is a different task to generic object recognition and needs different algorithms as in this Diettrich paper.
CNNs seem to be SoA for object recognition in natural images but FGVC might be better for this sort of task (Chris believes so, Diettrich's work is very similar problem and focusses on this kind of algorithm).
What if we try to combine these approaches:
This sort of divide and conquer approach might be awful but intuitively it seems a reasonable thing to at least try and optimistically might combine the best of both worlds. We could also consider modifying the class labels if obvious superclasses (maybe even using those provided by kaggle) emerge at this clustering step.