Experiment Results Based on New BERT

LargeFishPKU commented 4 years ago

Using new BERT vectors from Zied

LargeFishPKU commented 4 years ago

Results on the previous method: Jointly prototype let fv is visual prototype feature, ft is text feature, the joint feature f = coefficient fv + (1 - coefficient) ft. the coefficient is generated by a fully-connected network based on the input of ft

method	backbone	using_text_vector	iterations	5-way 5-shot
ProtoNet(baseline)	ResNet_10	None	60000	73.24% +- 0.63%
ProtoNet_Joint	ResNet_10	BERT_64_mask	60000	73.31% +- 0.66%
ProtoNet_Joint	ResNet_10	GloVe	60000	73.28% +- 0.67%
ProtoNet_Joint	ResNet_10	Con1	60000	73.24% +- 0.68%
ProtoNet_Joint	ResNet_10	Con2	60000	73.09% +- 0.70%
ProtoNet_Joint	ResNet_10	Con3	60000	72.95% +-0.67%
ProtoNet_Joint	ResNet_10	Con4	60000	72.92% +- 0.64%
ProtoNet_Joint	ResNet_10	BERT_64_unmask	60000	72.76% +- 0.68%
ProtoNet_Joint	ResNet_10	BERT_100_mask	60000	73.12% +- 0.71%
ProtoNet_Joint	ResNet_10	BERT_100_unmask	60000	72.32% +- 0.65%

from the above results, we can find:

Mask_BERT performs better than unmask one.
New_bert vectors perform better than the previous whose results can be found in link .
However, previous joint training method can not improve the baseline method.

Update 2020.0705
Now, we define the coefficient as a hyperparameter, set its value artificially not generated by a network like the previous one.
The results are as below:
the text vector is BERT_64_mask

method	backbone	coefficient	iterations	5-way 5-shot
ProtoNet_Joint	ResNet_10	0.3	60000	72.18% +- 0.67%
ProtoNet_Joint	ResNet_10	0.5	60000	73.54% +- 0.69%
ProtoNet_Joint	ResNet_10	0.7	60000	74.28% +- 0.68%
ProtoNet_Joint	ResNet_10	0.9	60000	73.64% +- 1.48%

Better news!!!: the method with coefficient with 0.7 is obviously better than the baseline (74.28% vs 73.24%)
Also, we can find that when combining the text feature, the smaller the proportion of text features, the better the effect.

LargeFishPKU commented 4 years ago

1. Results on the Facet method

For facet method, I just use BERT_64_mask vector to do corresponding experiments.
All experiments are based on the average importance operation.
The number of facets of all experiments is 8.
mtl means adding one middle layer for text vector to generate importance scores:

In other words, previous: 1024-> 8; now: 1024 -> 512 -> 8.

method	backbone	facet feature	mtl	iterations	5-way 5-shot
ProtoNet_facet	ResNet_10	FC	no	60000	70.01% +- 0.70%
ProtoNet_facet	ResNet_10	FC	yes	60000	69.98% +- 0.69%
ProtoNet_facet	ResNet_10	split	no	60000	71.32% +- 0.66%
ProtoNet_facet	ResNet_10	split	yes	60000	69.62% +- 0.65%

From the above results, we can find:

Introducing more parameters is no good for performance (using mtl is worse)
Split operation performs better than FC, as it does not introduce extra parameters

Update 2020.0705
Also, the previous method of facets are based on cosine metric, we conduct a comparative method based on Euclidean distance just used in ProtoNet.

method	backbone	facet feature	metric	iterations	5-way 5-shot
ProtoNet_facet	ResNet_10	split	cosine	60000	71.32% +- 0.66%
ProtoNet_facet	ResNet_10	split	euclidean	60000	71.87% +- 0.70%

we can find Euclidean metric is a little better than cosine metric.

2. About finding images according to importance scores. I do this thing in the following way: I split images during test phase according to importance scores. As there are 20 classes for testing, so I just split those 20 classes. However, I find the model is not trained as we expect. Because when using 8 facets, the importance distribution of those 20 classes should be among 8 importance scores. But I find the most important facets of those 20 classes corresponding to only 3 facets (1, 3, 4).

The details are as follows: {"4": ["n04146614", "n02871525", "n04522168", "n03775546", "n04149813", "n03272010", "n03146219", "n07613480", "n04418357", "n03127925"],
"3": ["n01930112", "n03544143", "n02219486", "n01981276"],
"1": ["n02099601", "n02129165", "n02110063", "n02110341", "n02116738", "n02443484"]}
For example, the number of the most important facet of class n04146614 is 4.

As only 3 facets are actually most important, I think maybe there are two reasons:

3 facets are enough instead of 8 or more.
The architecture of model is not designed well as the model is not trained as we expect.

So I run one experiment whose number of facets is 3 to see whether the model can perform better in this setting. The experiment is running, and we can get result before 7.1 night. updates: I have run a experiment with the number of facet is 3.	method	num_facets	iterations	facet_feature	5-way 5-shot
ProtoNet_Facet	3	60000	split	71.91% +- 0.68%
ProtoNet_Facet	8	60000	split	71.32% +- 0.66%

From the above experiments, we can see, reducing the number of facets can improve performance as it was consistent with results using Glove before. However, it still can not perform better than the baseline method. So, maybe the model is not designed well or this idea is not practical.

Update 2020.0705

To mitigate this problem, I try a method like this: add a loss function on the distribution of importance scores. Let say, we get importance scores of 8 facets: i1, i2, ..., i8. Then this loss (dis_loss) is defined as i1^2 + i2^2 +...+ i8^2. We do in this way because we want the distribution of importance score more gently to make sure every facet will correspond to the most important facet of one or more classes instead of only facet1, facet3, and facet4 like previous. To combine with the original loss (o_loss), we use a lambda to control the weight of dis_loss, the total loss is: lambda*dis_loss + o_loss the number of facets is 8.

method	backbone	facet feature	lambda	iterations	5-way 5-shot
ProtoNet_facet	ResNet_10	split	None	60000	71.32% +- 0.66%
ProtoNet_facet	ResNet_10	split	1	60000	71.66% +- 0.63%
ProtoNet_facet	ResNet_10	split	2	60000	72.37% +- 0.69%
ProtoNet_facet	ResNet_10	split	3	60000	72.12% +- 0.65%

We can see, add this extra loss can improve the performance to some extent. However, this best result(72.37) is still not better than the ProtoNet (73.24%)

3. About the experiment results using 512 facets as what Steven said before
In this way, the dimension of each facet is 1. As we talk before, first reduce the dimension of BERT vector from 1024 t0 64, then generate 512 importance scores using the 64-dimensional feature. However, these experiments cant not be convergent.

LargeFishPKU commented 4 years ago

These experimental results are based on what Steven said in E-mail 20200624, the content is as below:

"Let x1,…,x5 be the 5 training images you have for a given class (e.g. cat), then the prototype for that class is currently computed as:

(f(x1) + … + f(x5)) / 5

where f(xi) is the encoding of the image xi. What I’m suggesting is to instead use:

(f(x1) + … + f(x5) + lambda g(class)) / (5 + lambda)"

all experiments are based on BERT_64_mask vector

1. when trained jointly:

method	backbone	lambda	iterations	5-way 5-shot
ProtoNet_Joint	ResNet_10	0.5	60000	61.61% +- 0.77%
ProtoNet_Joint	ResNet_10	1	60000	63.84% +- 0.84%
ProtoNet_Joint	ResNet_10	2	60000	63.94% +- 0.76%	.

From above experimental results, we can see the performance is much worse than other methods (70% +).

2. when trained separately:
Step 1: Train the model using visual features only Step 2: Train the mapping g(class) using the prototypes you get with the model from Step 1 as training data, i.e. the task would be to predict the prototypes, rather than solve the few-shot classification task itself. Step 3: Train the full model, using the models from step 1 and step 2 to initialize the parameters of the visual part and the text part respectively.
the experiment is running...
update 20200709. Details: Step 1: train a previous protoNet. Step 2: the mapping network g() (fc -> normalization -> ReLU -> fc), finally get a text vector which dimensionality is the same as visual vector’s. Also in a 5-way 5-shot setting, we can get 5 visual prototypes (v1, v2, …, v5). Then we also get 5 text vectors from g() and original vectors from BERT (t1, t2, ..., t5). Then we want the visual and text vector belong to the same class as similar as possible. There are two ways to measure similarity: 1)Euclidean distance; 2) cosine similarity.
Problem: when using Euclidean distance, the model can not converge. The reason maybe the different feature space, so directly applying Euclidean may cause confuse problems for the model. Therefore, Step 2 is trained by cosine similarity. Step 3: This step is trained in a common 5-way 5-shot setting just like step 1. More concretely, we can get 5 visual prototype (v1, v2, …, v5), also 5 text “prototype” (t1, t2, …, t5). Given a query image, we get corresponding visual feature I, then we calculate the similarity in this way: Euclidean_distance(I, vi) + lambda * cosine_similarity(I, ti) (i = 1, 2, …,5)

Also try Euclidean_distance(I, vi) + lambda * Euclidean_distance(I, ti), but only get 51.53% accuracy, it can not be more worse!

method	backbone	lambda	iterations (lr)(lr_decay)	5-way 5-shot
ProtoNet_sepately	ResNet_10	1	60000(0.0001)	75.38% +- 0.63%
ProtoNet_sepately	ResNet_10	10	60000(0.0001)	74.80% +- 0.65%
ProtoNet_sepately	ResNet_10	5	60000(0.0001)	76.30% +- 0.76%
ProtoNet_sepately	ResNet_10	5	(30000, 60000)(0.0005)(0.2)	76.38% +- 0.61%
ProtoNet_sepately	ResNet_10	5	(20000, 40000, 60000)(0.0005)(0.2)	75.68% +- 0.63%	.

In my opinion, just according to results, it's enough for us to say, we have achieved the best results among metric-based methods.

LargeFishPKU / Joint-KG-Vision-Project

Experiment Results Based on New BERT #3