Invalid gradient error when training on GQA with mcan_small

jeasinema commented 3 years ago

Hi,

Thanks for this great project and the detailed doc. I really appreciate it.

I was trying to run some experiments on GQA with the default mcan_small model. I followed the instruction to prepare the GQA data and everything seemed to be working. However, when I launched the training with the following command

python3 run.py --RUN='train' --MODEL='mcan_small' --DATASET='gqa'

, I got the following error

[early log is omitted]
Loading validation set for per-epoch evaluation........
 ========== Dataset size: 12578
 ========== Question token vocab size: 2933
Max token length: 29 Trimmed to: 29
 ========== Answer token vocab size: 1843
Finished!

Initializing log file........
Finished!

[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
[W pthreadpool-cpp.cc:90] Warning: Leaking Caffe2 thread-pool after fork. (function pthreadpool)
Traceback (most recent call last):
  File "run.py", line 160, in <module>
    execution.run(__C.RUN_MODE)
  File "/local/xiaojianm/workspace/openvqa/utils/exec.py", line 33, in run
    train_engine(self.__C, self.dataset, self.dataset_eval)
  File "/local/xiaojianm/workspace/openvqa/utils/train_engine.py", line 192, in train_engine
    loss.backward()
  File "/local/xiaojianm/anaconda3/envs/default/lib/python3.8/site-packages/torch/_tensor.py", line 255, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/local/xiaojianm/anaconda3/envs/default/lib/python3.8/site-packages/torch/autograd/__init__.py", line 147, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Function MmBackward returned an invalid gradient at index 1 - got [6, 2048] but expected shape compatible with [5, 2048]

I'm wondering how can this issue happen and do you have any suggestions for debugging it?

I'm using torch==1.9.0 with spacy==2.3.7. Please feel free to ping me directly in this thread if more information is needed.

jeasinema commented 3 years ago

[Update]

I get it run by going backing to commit 1180a59 with spacy==2.1.0.

jeasinema commented 3 years ago

[Update 2]

I found the issue seems to be originated from here. The GQA experiment runs fine after commenting this line out.

MIL-VLG commented 2 years ago

Thanks! We have fixed this error.

MILVLG / openvqa

Invalid gradient error when training on GQA with mcan_small #74