cvjena / semantic-embeddings

Hierarchy-based Image Embeddings for Semantic Image Retrieval
MIT License
263 stars 50 forks source link

How to make inference for single image #3

Closed jalajjainai closed 4 years ago

jalajjainai commented 5 years ago

Hi, Its a good work. But i am wondering about few things. a) You dont insert any embedding layer rather last layer (output classes) is a embedding layer. Why so ? Will it still work if number of classes are few.

b) If embedding is one hot (no prior), how to make inference from the trained model ?

Callidior commented 5 years ago

Hi jalajjainai,

could you please be more specific about what you mean with "inference" here? It's as simple as conducting a forward-pass of the trained network and extracting activations from the "l2norm" layer if you want to compute semantic image embeddings. All the pre-trained checkpoint provided in this repository have been trained with a combination of the semantic embedding objective and a standard cross-entropy loss, so you can also extract class probabilities from the layer "prob" (last layer). You don't need to fiddle with the network architecture yourself, since all the pre-trained models have two outputs: The first one is the embeddings and the second one is class probabilities. Details about pre-processing of the images can be found in section 4.2 of the readme. If you use one-hot embeddings, you can also make class predictions directly from the embedding layer by taking the argmax.

Regarding your other question: The point of this work was to integrate prior semantic knowledge into deep learning. In our case, this knowledge is given in the form of semantic similarities between classes. An embedding dimensionality that equals the number of classes is completely sufficient to capture this kind of knowledge. If you only have a handful of classes, there simply is not much knowledge. Consequently, there is not much benefit you could draw from this method in this case, unfortunately.

jalajjainai commented 5 years ago

Hi, Thanks for your detailed reply. I thought same , argmax for one hot embedding. I am facing a weird problem. I have 2 class, normal vs abnormal for my project. What surprise me is following. Lets say i have 2 class, and i trained the inverse cosine loss based network for classification. when i test a image which does not belong to either class from training, still network predicts to one of training class. it is very strange. do u have any opinion on it ? jalaj

Callidior commented 5 years ago

Hi, first I would like to note that this setup (anomaly detection) is not an optimal fit for the framework of hierarchy-based semantic embeddings. Not only is the number of classes much too small, but the classes are also very generic, so that there can be close to no prior knowledge about them. I am also wondering how any test image can be from neither one of the classes - it must be either normal or abnormal. Of course, the network will always predict a similarity between the sample and both classes. And one of these similarities will almost always be larger than all other ones. If you take the argmax, you always get a prediction. But how certain is this prediction? You could try looking at the similarity scores directly. They will always be between 1 (most similar) and -1 (most dissimilar). Maybe you can find a useful threshold for detecting novel samples. But in general, I recommend looking into the novelty detection and anomaly detection literature for this task.

jalajjainai commented 5 years ago

Hi, i meant to say that if there is a target class DOG which is normal and CAT is abnormal. During, testing, i found that if lets say test image is of class TIGER, network still produces high similarity score for one of the target class. However, i have very less sample (Around 50) for the target class. And so, i was interested in your work to apply for my problem. I think with handful of the classes and less data, network is not able to learn, descriptive features even with cosine less.

Callidior commented 5 years ago

I understand your problem now. The smallest dataset we tested was for text classification with only 4 classes. We still saw a large benefit of cosine loss over cross-entropy with as few as 10 samples per class (i.e., 40 samples in total). However, pre-trained word embeddings have been used for this and the situation for image classification will probably be different if you learn everything from scratch due to severe overfitting. It is true that wile our paper has shown results with few samples per class, we still had more than 100 classes in each experiment and never less than 2k images in total, so that there was much more diversity in the data than you have with two classes. CNN training from scratch in general has a closed-world assumption, which the cosine loss does not change either. It just might improve accuracy for these classes if you have few training data. But if you just train with images of cats and dogs and your CNN never sees anything else, it will just assume that every possible image either shows a cat or a dog, so that the learned features are very specific to this distinction. For example, for distinguishing between cats and dogs the network might have learned that it is sufficient to look whether there are whiskers in the image or not. Give it an image of a tiger: well, it has whiskers, must be a cat. The network just did not learn any other features because it was not necessary for the task you gave it.