Open LargeFishPKU opened 4 years ago
Results on the previous method: Jointly prototype let fv is visual prototype feature, ft is text feature, the joint feature f = coefficient fv + (1 - coefficient) ft. the coefficient is generated by a fully-connected network based on the input of ft
method | backbone | using_text_vector | iterations | 5-way 5-shot |
---|---|---|---|---|
ProtoNet(baseline) | ResNet_10 | None | 60000 | 73.24% +- 0.63% |
ProtoNet_Joint | ResNet_10 | BERT_64_mask | 60000 | 73.31% +- 0.66% |
ProtoNet_Joint | ResNet_10 | GloVe | 60000 | 73.28% +- 0.67% |
ProtoNet_Joint | ResNet_10 | Con1 | 60000 | 73.24% +- 0.68% |
ProtoNet_Joint | ResNet_10 | Con2 | 60000 | 73.09% +- 0.70% |
ProtoNet_Joint | ResNet_10 | Con3 | 60000 | 72.95% +-0.67% |
ProtoNet_Joint | ResNet_10 | Con4 | 60000 | 72.92% +- 0.64% |
ProtoNet_Joint | ResNet_10 | BERT_64_unmask | 60000 | 72.76% +- 0.68% |
ProtoNet_Joint | ResNet_10 | BERT_100_mask | 60000 | 73.12% +- 0.71% |
ProtoNet_Joint | ResNet_10 | BERT_100_unmask | 60000 | 72.32% +- 0.65% |
from the above results, we can find:
Update 2020.0705
Now, we define the coefficient as a hyperparameter, set its value artificially not generated by a network like the previous one.
The results are as below:
the text vector is BERT_64_mask
method | backbone | coefficient | iterations | 5-way 5-shot |
---|---|---|---|---|
ProtoNet_Joint | ResNet_10 | 0.3 | 60000 | 72.18% +- 0.67% |
ProtoNet_Joint | ResNet_10 | 0.5 | 60000 | 73.54% +- 0.69% |
ProtoNet_Joint | ResNet_10 | 0.7 | 60000 | 74.28% +- 0.68% |
ProtoNet_Joint | ResNet_10 | 0.9 | 60000 | 73.64% +- 1.48% |
Better news!!!: the method with coefficient with 0.7 is obviously better than the baseline (74.28% vs 73.24%)
Also, we can find that when combining the text feature, the smaller the proportion of text features, the better the effect.
1. Results on the Facet method
For facet method, I just use BERT_64_mask vector to do corresponding experiments.
All experiments are based on the average importance operation.
The number of facets of all experiments is 8.
mtl means adding one middle layer for text vector to generate importance scores:
In other words, previous: 1024-> 8; now: 1024 -> 512 -> 8.
method | backbone | facet feature | mtl | iterations | 5-way 5-shot |
---|---|---|---|---|---|
ProtoNet_facet | ResNet_10 | FC | no | 60000 | 70.01% +- 0.70% |
ProtoNet_facet | ResNet_10 | FC | yes | 60000 | 69.98% +- 0.69% |
ProtoNet_facet | ResNet_10 | split | no | 60000 | 71.32% +- 0.66% |
ProtoNet_facet | ResNet_10 | split | yes | 60000 | 69.62% +- 0.65% |
From the above results, we can find:
Update 2020.0705
Also, the previous method of facets are based on cosine metric, we conduct a comparative method based on Euclidean distance just used in ProtoNet.
method | backbone | facet feature | metric | iterations | 5-way 5-shot |
---|---|---|---|---|---|
ProtoNet_facet | ResNet_10 | split | cosine | 60000 | 71.32% +- 0.66% |
ProtoNet_facet | ResNet_10 | split | euclidean | 60000 | 71.87% +- 0.70% |
we can find Euclidean metric is a little better than cosine metric.
2. About finding images according to importance scores. I do this thing in the following way: I split images during test phase according to importance scores. As there are 20 classes for testing, so I just split those 20 classes. However, I find the model is not trained as we expect. Because when using 8 facets, the importance distribution of those 20 classes should be among 8 importance scores. But I find the most important facets of those 20 classes corresponding to only 3 facets (1, 3, 4).
The details are as follows:
{"4": ["n04146614", "n02871525", "n04522168", "n03775546", "n04149813", "n03272010", "n03146219", "n07613480", "n04418357", "n03127925"],
"3": ["n01930112", "n03544143", "n02219486", "n01981276"],
"1": ["n02099601", "n02129165", "n02110063", "n02110341", "n02116738", "n02443484"]}
For example, the number of the most important facet of class n04146614 is 4.
As only 3 facets are actually most important, I think maybe there are two reasons:
So I run one experiment whose number of facets is 3 to see whether the model can perform better in this setting. The experiment is running, and we can get result before 7.1 night. updates: I have run a experiment with the number of facet is 3. |
method | num_facets | iterations | facet_feature | 5-way 5-shot |
---|---|---|---|---|---|
ProtoNet_Facet | 3 | 60000 | split | 71.91% +- 0.68% | |
ProtoNet_Facet | 8 | 60000 | split | 71.32% +- 0.66% |
From the above experiments, we can see, reducing the number of facets can improve performance as it was consistent with results using Glove before. However, it still can not perform better than the baseline method. So, maybe the model is not designed well or this idea is not practical.
Update 2020.0705
To mitigate this problem, I try a method like this: add a loss function on the distribution of importance scores. Let say, we get importance scores of 8 facets: i1, i2, ..., i8. Then this loss (dis_loss) is defined as i1^2 + i2^2 +...+ i8^2. We do in this way because we want the distribution of importance score more gently to make sure every facet will correspond to the most important facet of one or more classes instead of only facet1, facet3, and facet4 like previous. To combine with the original loss (o_loss), we use a lambda to control the weight of dis_loss, the total loss is: lambda*dis_loss + o_loss the number of facets is 8.
method | backbone | facet feature | lambda | iterations | 5-way 5-shot |
---|---|---|---|---|---|
ProtoNet_facet | ResNet_10 | split | None | 60000 | 71.32% +- 0.66% |
ProtoNet_facet | ResNet_10 | split | 1 | 60000 | 71.66% +- 0.63% |
ProtoNet_facet | ResNet_10 | split | 2 | 60000 | 72.37% +- 0.69% |
ProtoNet_facet | ResNet_10 | split | 3 | 60000 | 72.12% +- 0.65% |
We can see, add this extra loss can improve the performance to some extent. However, this best result(72.37) is still not better than the ProtoNet (73.24%)
3. About the experiment results using 512 facets as what Steven said before
In this way, the dimension of each facet is 1. As we talk before, first reduce the dimension of BERT vector from 1024 t0 64, then generate 512 importance scores using the 64-dimensional feature. However, these experiments cant not be convergent.
These experimental results are based on what Steven said in E-mail 20200624, the content is as below:
"Let x1,…,x5 be the 5 training images you have for a given class (e.g. cat), then the prototype for that class is currently computed as:
(f(x1) + … + f(x5)) / 5
where f(xi) is the encoding of the image xi. What I’m suggesting is to instead use:
(f(x1) + … + f(x5) + lambda g(class)) / (5 + lambda)"
all experiments are based on BERT_64_mask vector
1. when trained jointly:
method | backbone | lambda | iterations | 5-way 5-shot | |
---|---|---|---|---|---|
ProtoNet_Joint | ResNet_10 | 0.5 | 60000 | 61.61% +- 0.77% | |
ProtoNet_Joint | ResNet_10 | 1 | 60000 | 63.84% +- 0.84% | |
ProtoNet_Joint | ResNet_10 | 2 | 60000 | 63.94% +- 0.76% | . |
From above experimental results, we can see the performance is much worse than other methods (70% +).
2. when trained separately:
Step 1: Train the model using visual features only
Step 2: Train the mapping g(class) using the prototypes you get with the model from Step 1 as training data, i.e. the task would be to predict the prototypes, rather than solve the few-shot classification task itself.
Step 3: Train the full model, using the models from step 1 and step 2 to initialize the parameters of the visual part and the text part respectively.
the experiment is running...
update 20200709.
Details:
Step 1: train a previous protoNet.
Step 2: the mapping network g() (fc -> normalization -> ReLU -> fc), finally get a text vector which dimensionality is the same as visual vector’s. Also in a 5-way 5-shot setting, we can get 5 visual prototypes (v1, v2, …, v5). Then we also get 5 text vectors from g() and original vectors from BERT (t1, t2, ..., t5). Then we want the visual and text vector belong to the same class as similar as possible. There are two ways to measure similarity: 1)Euclidean distance; 2) cosine similarity.
Problem: when using Euclidean distance, the model can not converge. The reason maybe the different feature space, so directly applying Euclidean may cause confuse problems for the model.
Therefore, Step 2 is trained by cosine similarity.
Step 3: This step is trained in a common 5-way 5-shot setting just like step 1. More concretely, we can get 5 visual prototype (v1, v2, …, v5), also 5 text “prototype” (t1, t2, …, t5). Given a query image, we get corresponding visual feature I, then we calculate the similarity in this way:
Euclidean_distance(I, vi) + lambda * cosine_similarity(I, ti) (i = 1, 2, …,5)
Also try Euclidean_distance(I, vi) + lambda * Euclidean_distance(I, ti), but only get 51.53% accuracy, it can not be more worse!
method | backbone | lambda | iterations (lr)(lr_decay) | 5-way 5-shot | |
---|---|---|---|---|---|
ProtoNet_sepately | ResNet_10 | 1 | 60000(0.0001) | 75.38% +- 0.63% | |
ProtoNet_sepately | ResNet_10 | 10 | 60000(0.0001) | 74.80% +- 0.65% | |
ProtoNet_sepately | ResNet_10 | 5 | 60000(0.0001) | 76.30% +- 0.76% | |
ProtoNet_sepately | ResNet_10 | 5 | (30000, 60000)(0.0005)(0.2) | 76.38% +- 0.61% | |
ProtoNet_sepately | ResNet_10 | 5 | (20000, 40000, 60000)(0.0005)(0.2) | 75.68% +- 0.63% | . |
In my opinion, just according to results, it's enough for us to say, we have achieved the best results among metric-based methods.
Using new BERT vectors from Zied