LargeFishPKU / Joint-KG-Vision-Project

0 stars 0 forks source link

Experiment Results Based on New BERT #3

Open LargeFishPKU opened 4 years ago

LargeFishPKU commented 4 years ago

Using new BERT vectors from Zied

LargeFishPKU commented 4 years ago

Results on the previous method: Jointly prototype let fv is visual prototype feature, ft is text feature, the joint feature f = coefficient fv + (1 - coefficient) ft. the coefficient is generated by a fully-connected network based on the input of ft

method backbone using_text_vector iterations 5-way 5-shot
ProtoNet(baseline) ResNet_10 None 60000 73.24% +- 0.63%
ProtoNet_Joint ResNet_10 BERT_64_mask 60000 73.31% +- 0.66%
ProtoNet_Joint ResNet_10 GloVe 60000 73.28% +- 0.67%
ProtoNet_Joint ResNet_10 Con1 60000 73.24% +- 0.68%
ProtoNet_Joint ResNet_10 Con2 60000 73.09% +- 0.70%
ProtoNet_Joint ResNet_10 Con3 60000 72.95% +-0.67%
ProtoNet_Joint ResNet_10 Con4 60000 72.92% +- 0.64%
ProtoNet_Joint ResNet_10 BERT_64_unmask 60000 72.76% +- 0.68%
ProtoNet_Joint ResNet_10 BERT_100_mask 60000 73.12% +- 0.71%
ProtoNet_Joint ResNet_10 BERT_100_unmask 60000 72.32% +- 0.65%

from the above results, we can find:

  1. Mask_BERT performs better than unmask one.
  2. New_bert vectors perform better than the previous whose results can be found in link .
  3. However, previous joint training method can not improve the baseline method.

Update 2020.0705
Now, we define the coefficient as a hyperparameter, set its value artificially not generated by a network like the previous one.
The results are as below:
the text vector is BERT_64_mask

method backbone coefficient iterations 5-way 5-shot
ProtoNet_Joint ResNet_10 0.3 60000 72.18% +- 0.67%
ProtoNet_Joint ResNet_10 0.5 60000 73.54% +- 0.69%
ProtoNet_Joint ResNet_10 0.7 60000 74.28% +- 0.68%
ProtoNet_Joint ResNet_10 0.9 60000 73.64% +- 1.48%

Better news!!!: the method with coefficient with 0.7 is obviously better than the baseline (74.28% vs 73.24%)
Also, we can find that when combining the text feature, the smaller the proportion of text features, the better the effect.

LargeFishPKU commented 4 years ago

1. Results on the Facet method

method backbone facet feature mtl iterations 5-way 5-shot
ProtoNet_facet ResNet_10 FC no 60000 70.01% +- 0.70%
ProtoNet_facet ResNet_10 FC yes 60000 69.98% +- 0.69%
ProtoNet_facet ResNet_10 split no 60000 71.32% +- 0.66%
ProtoNet_facet ResNet_10 split yes 60000 69.62% +- 0.65%

From the above results, we can find:

  1. Introducing more parameters is no good for performance (using mtl is worse)
  2. Split operation performs better than FC, as it does not introduce extra parameters

Update 2020.0705
Also, the previous method of facets are based on cosine metric, we conduct a comparative method based on Euclidean distance just used in ProtoNet.

method backbone facet feature metric iterations 5-way 5-shot
ProtoNet_facet ResNet_10 split cosine 60000 71.32% +- 0.66%
ProtoNet_facet ResNet_10 split euclidean 60000 71.87% +- 0.70%

we can find Euclidean metric is a little better than cosine metric.

2. About finding images according to importance scores. I do this thing in the following way: I split images during test phase according to importance scores. As there are 20 classes for testing, so I just split those 20 classes. However, I find the model is not trained as we expect. Because when using 8 facets, the importance distribution of those 20 classes should be among 8 importance scores. But I find the most important facets of those 20 classes corresponding to only 3 facets (1, 3, 4).

The details are as follows: {"4": ["n04146614", "n02871525", "n04522168", "n03775546", "n04149813", "n03272010", "n03146219", "n07613480", "n04418357", "n03127925"],
"3": ["n01930112", "n03544143", "n02219486", "n01981276"],
"1": ["n02099601", "n02129165", "n02110063", "n02110341", "n02116738", "n02443484"]}
For example, the number of the most important facet of class n04146614 is 4.

As only 3 facets are actually most important, I think maybe there are two reasons:

  1. 3 facets are enough instead of 8 or more.
  2. The architecture of model is not designed well as the model is not trained as we expect.
So I run one experiment whose number of facets is 3 to see whether the model can perform better in this setting. The experiment is running, and we can get result before 7.1 night.
updates: I have run a experiment with the number of facet is 3.
method num_facets iterations facet_feature 5-way 5-shot
ProtoNet_Facet 3 60000 split 71.91% +- 0.68%
ProtoNet_Facet 8 60000 split 71.32% +- 0.66%

From the above experiments, we can see, reducing the number of facets can improve performance as it was consistent with results using Glove before. However, it still can not perform better than the baseline method. So, maybe the model is not designed well or this idea is not practical.

Update 2020.0705

To mitigate this problem, I try a method like this: add a loss function on the distribution of importance scores. Let say, we get importance scores of 8 facets: i1, i2, ..., i8. Then this loss (dis_loss) is defined as i1^2 + i2^2 +...+ i8^2. We do in this way because we want the distribution of importance score more gently to make sure every facet will correspond to the most important facet of one or more classes instead of only facet1, facet3, and facet4 like previous. To combine with the original loss (o_loss), we use a lambda to control the weight of dis_loss, the total loss is: lambda*dis_loss + o_loss the number of facets is 8.

method backbone facet feature lambda iterations 5-way 5-shot
ProtoNet_facet ResNet_10 split None 60000 71.32% +- 0.66%
ProtoNet_facet ResNet_10 split 1 60000 71.66% +- 0.63%
ProtoNet_facet ResNet_10 split 2 60000 72.37% +- 0.69%
ProtoNet_facet ResNet_10 split 3 60000 72.12% +- 0.65%

We can see, add this extra loss can improve the performance to some extent. However, this best result(72.37) is still not better than the ProtoNet (73.24%)

3. About the experiment results using 512 facets as what Steven said before
In this way, the dimension of each facet is 1. As we talk before, first reduce the dimension of BERT vector from 1024 t0 64, then generate 512 importance scores using the 64-dimensional feature. However, these experiments cant not be convergent.

LargeFishPKU commented 4 years ago

These experimental results are based on what Steven said in E-mail 20200624, the content is as below:

"Let x1,…,x5 be the 5 training images you have for a given class (e.g. cat), then the prototype for that class is currently computed as:

(f(x1) + … + f(x5)) / 5

where f(xi) is the encoding of the image xi. What I’m suggesting is to instead use:

(f(x1) + … + f(x5) + lambda g(class)) / (5 + lambda)"

all experiments are based on BERT_64_mask vector

1. when trained jointly:

method backbone lambda iterations 5-way 5-shot
ProtoNet_Joint ResNet_10 0.5 60000 61.61% +- 0.77%
ProtoNet_Joint ResNet_10 1 60000 63.84% +- 0.84%
ProtoNet_Joint ResNet_10 2 60000 63.94% +- 0.76% .

From above experimental results, we can see the performance is much worse than other methods (70% +).

2. when trained separately:
Step 1: Train the model using visual features only Step 2: Train the mapping g(class) using the prototypes you get with the model from Step 1 as training data, i.e. the task would be to predict the prototypes, rather than solve the few-shot classification task itself. Step 3: Train the full model, using the models from step 1 and step 2 to initialize the parameters of the visual part and the text part respectively.
the experiment is running...
update 20200709. Details: Step 1: train a previous protoNet. Step 2: the mapping network g() (fc -> normalization -> ReLU -> fc), finally get a text vector which dimensionality is the same as visual vector’s. Also in a 5-way 5-shot setting, we can get 5 visual prototypes (v1, v2, …, v5). Then we also get 5 text vectors from g() and original vectors from BERT (t1, t2, ..., t5). Then we want the visual and text vector belong to the same class as similar as possible. There are two ways to measure similarity: 1)Euclidean distance; 2) cosine similarity.
Problem: when using Euclidean distance, the model can not converge. The reason maybe the different feature space, so directly applying Euclidean may cause confuse problems for the model. Therefore, Step 2 is trained by cosine similarity. Step 3: This step is trained in a common 5-way 5-shot setting just like step 1. More concretely, we can get 5 visual prototype (v1, v2, …, v5), also 5 text “prototype” (t1, t2, …, t5). Given a query image, we get corresponding visual feature I, then we calculate the similarity in this way: Euclidean_distance(I, vi) + lambda * cosine_similarity(I, ti) (i = 1, 2, …,5)

Also try Euclidean_distance(I, vi) + lambda * Euclidean_distance(I, ti), but only get 51.53% accuracy, it can not be more worse!

method backbone lambda iterations (lr)(lr_decay) 5-way 5-shot
ProtoNet_sepately ResNet_10 1 60000(0.0001) 75.38% +- 0.63%
ProtoNet_sepately ResNet_10 10 60000(0.0001) 74.80% +- 0.65%
ProtoNet_sepately ResNet_10 5 60000(0.0001) 76.30% +- 0.76%
ProtoNet_sepately ResNet_10 5 (30000, 60000)(0.0005)(0.2) 76.38% +- 0.61%
ProtoNet_sepately ResNet_10 5 (20000, 40000, 60000)(0.0005)(0.2) 75.68% +- 0.63% .

In my opinion, just according to results, it's enough for us to say, we have achieved the best results among metric-based methods.