The inference process runs repeatedly？

ZhiXianZ commented 2 years ago

In the process of my meta-training, the inference process of k=1, 2, 3, 5, 10 was repeated twice, and the final AP value was exactly the same.Whether the problem about code?

GuangxingHan commented 2 years ago

Can you show the running log of the meta-training process?

Yes, we have two steps for meta-training. The first step is simply the baseline model following FewX. The second step is the results of our full model. Both of the two steps will produce the meta-testing results under 1/2/3/5/10-shots. They should be different and the second step with much stronger results.

ZhiXianZ commented 2 years ago

Can you show the running log of the meta-training process?

Yes, we have two steps for meta-training. The first step is simply the baseline model following FewX. The second step is the results of our full model. Both of the two steps will produce the meta-testing results under 1/2/3/5/10-shots. They should be different and the second step with much stronger results.

Thanks, I know. But, for meta_training stage 1 and stage 2, it repeated . log.txt

GuangxingHan commented 2 years ago

The log only shows the full log of the first step meta-training/testing, ending at Line 3075. Starting from L3076 to the end of the file, it is weird to see another meta-testing results without meta-training.

Can you manually run the second meta-training step using the command inside the script here?

ZhiXianZ commented 2 years ago

The log only shows the full log of the first step meta-training/testing, ending at Line 3075. Starting from L3076 to the end of the file, it is weird to see another meta-testing results without meta-training.

Can you manually run the second meta-training step using the command inside the script here?

No, I just modify the BATCH_SIZE and num-gpus.

meta_training_pascalvoc_split2_resnet101.sh: CUDA_VISIBLE_DEVICES=0 python3 fsod_train_net_fewx.py --num-gpus 1 --dist-url auto \ --config-file configs/fsod/meta_training_pascalvoc_split2_resnet101_stage_1.yaml 2>&1 | tee log/meta_training_pascalvoc_split2_resnet101_stage_1.txt CUDA_VISIBLE_DEVICES=0 python3 fsod_train_net.py --num-gpus 1 --dist-url auto \ --config-file configs/fsod/meta_training_pascalvoc_split2_resnet101_stage_2.yaml 2>&1 | tee log/meta_training_pascalvoc_split2_resnet101_stage_2.txt

GuangxingHan commented 2 years ago

Can you double check whether the trained model in the second step has the newly-learned parameters related to GCN module here? gcn_model.gcn_layer.norm.{weight, bias} gcn_model.gcn_layer.graph_conv.weight

Ideally, the second step should invoke the QA-FewDet module which defines the FSOD model with GCN layer here. The scripts work well in my machine. Could you debug the code and provide more information about that? Then let's see what happened there.

ZhiXianZ commented 2 years ago

Sorry, I don't know how to do parameter checking. Can you be more specific?

GuangxingHan commented 2 years ago

Sorry for the late reply, and I was very busy these days.

You can use the following python commands to see whether GCN layers are there or not.

# first cd to the directory of the model path
import torch
a = torch.load("model_final.pth") # "model_final.pth" is the model name
# a is a python dictionary variable, a['model'] stores its model parameters
for key,value in a['model'].items():
     if 'gcn' in key:
         print(key, value.shape)

Hope this is helpful for you.

ZhiXianZ commented 2 years ago

Thanks for your wonderful reply. I have used above code block to check meta-training_stage1 and meta-training_stage2. No output in stage1, and some output in stage2: gcn_model.gcn_layer.graph_conv.weight torch.Size([2048, 2048, 1, 1]) gcn_model.gcn_layer.norm.weight torch.Size([2048, 7, 7]) gcn_model.gcn_layer.norm.bias torch.Size([2048, 7, 7])

I'm wondering is why in stage1 and stage2, the 1, 2, 3, 5, 10shots have to repeat the inference a second time.

GuangxingHan commented 2 years ago

I see. It is weird. Does this happen only for voc split2, or similar for other splits?

Can you share with me the trained models after the two stages, so that I can study this problem carefully?

ZhiXianZ commented 2 years ago

Yes, it happened at each split. But model_final.pth too big to upload.

GuangxingHan commented 2 years ago

I see. What I want to do is to first evaluate the trained models with meta-testing-only mode to confirm the results, and also compare the model parameters of each layer in the two models to see whether they are the same or not. Models with different parameters should produce different results. Can you conduct these experiments and it would be interesting to see the results.

ZhiXianZ commented 2 years ago

OK, but it inference twice during training baseline model stage (Get the same result). It inference twice under the same model instead of comparing the parameters of the two models? The above log.txt I uploaded is just including baseline model training (meta-training_stage1).

GuangxingHan commented 2 years ago

Sorry, I totally misunderstood your question.

This sounds like a minor issue. I do not know why it happens, but it might be related to the version of Detectron or pytorch.

One way to avoid this problem could be simply return empty in the test function here during meta-training, and evaluate the trained model using the meta-testing mode after meta-training.

GuangxingHan / QA-FewDet

The inference process runs repeatedly？ #2