ItemZheng / KDDAug

[ECCV2022] Rethinking Data Augmentation for Robust Visual Question Answering
10 stars 0 forks source link

Error in 'KD-based Answer Assignment' #6

Open hegdekartik opened 1 year ago

hegdekartik commented 1 year ago


Thanks for the great work. I found your work interesting, so I wanted to try this out. But in 'KD-based Answer Assignment', we are getting errors.

We are getting the following error when we run the following command:

CUDA_VISIBLE_DEVICES=0 python --dataset v2 --mode q_v_debias --debias learned_mixin --topq 1 --topv -1 --qvp 5 --output lmh_css --seed 2048
Traceback (most recent call last):
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/KDDAug/", line 178, in <module>
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/KDDAug/", line 175, in main
    train(model, train_loader, eval_loader, args,qid2type)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/KDDAug/", line 219, in train
    word_grad = torch.autograd.grad((pred * (a > 0).float()).sum(), word_emb, create_graph=True)[0]
  File "/home

So we tried the other way given, which is using a pretrained teacher model (CSS) download from CSS-VQA. But unfortunately, after downloading 'model.pth' and running 'Assign new answer' command we got error as below.

CUDA_VISIBLE_DEVICES=0 python --dataset v2 --name number --split high
100%|███████████████████████████████████████████████████| 443757/443757 [00:00<00:00, 946121.94it/s]
100%|███████████████████████████████████████████████████| 443757/443757 [00:02<00:00, 167279.98it/s]
Get language bias, which is an input of CSS teacher model.
loading dictionary from data/dictionary.pkl
tokenize: 100%|██████████████████████████████████████████| 443757/443757 [00:04<00:00, 97819.23it/s]
tensorize: 100%|████████████████████████████████████████| 443757/443757 [00:04<00:00, 109012.99it/s]
Load model from: ./logs/lmh_css/model.pth
Traceback (most recent call last):
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/KDDAug/", line 171, in <module>
  File "/home/kartik/.conda/envs/BLIP_env/lib/python3.10/site-packages/torch/nn/modules/", line 1667, in load_state_dict
    raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for BaseModel:
        size mismatch for classifier.main.3.bias: copying a param with shape torch.Size([2274]) from checkpoint, the shape in current model is torch.Size([2410]).
        size mismatch for classifier.main.3.weight_v: copying a param with shape torch.Size([2274, 2048]) from checkpoint, the shape in current model is torch.Size([2410, 2048]).

How can I get rid of this error?

Thank you

ItemZheng commented 1 year ago

For the first error, can you provide more error logs? For the second error, you were missing the argument --teacher_path, and the entire command is CUDA_VISIBLE_DEVICES=0 python --dataset [cpv2/v2] --name number --split high --teacher_path [] mentioned in

hegdekartik commented 1 year ago

For the second error, --teacher_path was an optional argument. So we added the model.pth into the correct folder mentioned in the, which is './logs/lmh_css/model.pth.

Could you please provide the correct link to the right model.pth for this step?

Error logs for the first error :

Building train dataset...
caching-features: 100%|████████████████████████████████████| 443757/443757 [38:56<00:00, 189.96it/s]
tokenize: 100%|█████████████████████████████████████████| 443757/443757 [00:03<00:00, 119740.31it/s]
tensorize: 100%|████████████████████████████████████████| 443757/443757 [00:04<00:00, 106497.75it/s]
Building test dataset...
caching-features: 100%|████████████████████████████████████| 214354/214354 [18:59<00:00, 188.16it/s]
tokenize: 100%|██████████████████████████████████████████| 214354/214354 [00:04<00:00, 48356.19it/s]
tensorize: 100%|████████████████████████████████████████| 214354/214354 [00:01<00:00, 109298.11it/s]
Starting training...
Epoch 1:   0%|                                                              | 0/867 [00:00<?, ?it/s]/home/kartik/.conda/envs/BLIP_env/lib/python3.10/site-packages/torch/nn/ UserWarning: nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.
  warnings.warn("nn.functional.sigmoid is deprecated. Use torch.sigmoid instead.")
Epoch 1:   0%|                                                              | 0/867 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/KDDAug/", line 178, in <module>
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/KDDAug/", line 175, in main
    train(model, train_loader, eval_loader, args,qid2type)
  File "/mnt/44b643af-38ed-4d24-abcc-00e81b36025c/kartik/KDDAug/", line 280, in train
    visual_grad = torch.autograd.grad((pred * (a > 0).float()).sum(), v, create_graph=True)[0]
  File "/home/kartik/.conda/envs/BLIP_env/lib/python3.10/site-packages/torch/autograd/", line 300, in grad
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512, 2048]], which is output 0 of ReluBackward0, is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
hegdekartik commented 1 year ago


I am still having this issue. Can you please check and help me resolve this issue? Thank you.