facebookresearch / vilbert-multi-task

Multi Task Vision and Language
MIT License
800 stars 180 forks source link

bug for ForwardModelsVal #69

Open zongshenmu opened 3 years ago

zongshenmu commented 3 years ago

why it happens that the input and target size of cross entropy loss are not matched?

Traceback (most recent call last):
  File "train_tasks.py", line 679, in <module>
    main()
  File "train_tasks.py", line 604, in main
    tbLogger,
  File "train_tasks.py", line 662, in evaluate
    args, task_cfg, device, task_id, batch, model, task_losses
  File "/data1/mzs/Code/vilbert-multi-task/vilbert/task_utils.py", line 155, in ForwardModelsVal
    loss = task_losses[task_id](vil_binary_prediction, target)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward
    reduction=self.reduction)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits
    raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([6, 2])) must be the same as input size (torch.Size([12, 2]))
vedanuj commented 3 years ago

Which task are you running on? Also please share your training command.

zongshenmu commented 3 years ago

I pull your code from repository and only change the train batch size in the vilbert_tasks.yml. I run the Multi-task Training as you mentioned in the README.md but fail in the NLVR dataset. I debug the dataset and find it is abnormal condition of the evaluation. It is always failed in the last iteration in the evaluation process of the model. When the batch data is fed in the model, the batch size between target data and the vil_binary_prediction data is not matched, which causes the task_losse, binary_cross_entropy_with_logits, cannot calculate right.

vil_binary_prediction torch.Size([12, 2]) target torch.Size([6, 2])
Traceback (most recent call last):
  File "train_tasks.py", line 682, in <module>
    main()
  File "train_tasks.py", line 605, in main
    tbLogger,
  File "train_tasks.py", line 665, in evaluate
    args, task_cfg, device, task_id, batch, model, task_losses
  File "/data1/mzs/Code/vilbert-multi-task/vilbert/task_utils.py", line 157, in ForwardModelsVal
    loss = task_losses[task_id](vil_binary_prediction, target)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/modules/loss.py", line 601, in forward
    reduction=self.reduction)
  File "/data0/mzs/anaconda3/envs/vilbert-mt/lib/python3.6/site-packages/torch/nn/functional.py", line 2124, in binary_cross_entropy_with_logits
    raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
ValueError: Target size (torch.Size([6, 2])) must be the same as input size (torch.Size([12, 2]))

I also find it is possibly not fit for 1688-1691 lines of the code in the vilbert.py. When the batch size is odd, it does not run.

 if pooled_output.size(0) % 2 == 0:
            vil_binary_prediction = self.vil_binary_prediction(
                pooled_output.view(-1, pooled_output.size(1) * 2)
            )
zongshenmu commented 3 years ago

Your code in eval_tasks.py for retrieval_datasets.py is also wrong. Dataloader cannot analyze the right batch size and num options:

    elif task_cfg[task_id]["process"] in ["retrieval"]:
        max_num_bbox = features.size(1)
        num_options = question.size(1)

        features = features.view(-1, features.size(2), features.size(3))
        spatials = spatials.view(-1, spatials.size(2), spatials.size(3))
        image_mask = image_mask.view(-1, image_mask.size(2))
        question = question.view(-1, question.size(2))
        input_mask = input_mask.view(-1, input_mask.size(2))
        segment_ids = segment_ids.view(-1, segment_ids.size(2))
        co_attention_mask = co_attention_mask.view(
            -1, co_attention_mask.size(2), co_attention_mask.size(3)
        )

Can you provide your retrieval evaluation metric?

ZhiyuanChen commented 3 years ago

Code quality of this repo is surprisingly low. I couldn't believe this is an engineering of facebook. It would take much less effort to just rewrite than debug.

chen398936790 commented 3 years ago

Dear all,

I met the same problem when I tried to run multi-task learning with the command in README.md.

The training had already continued for hours and I can even see a validation on refcocog was done in iter 513. But the code returned the same ValueError at iter 661 at last.

Do you find any solution to that? Or are there any ideas to solve this problem?

Thank you!

enaserianhanzaei commented 3 years ago

it is three weeks that I'm trying to run this code, it's unbelievable how full of bug it is, I'm just wondering if the results they reported are actually true.

enaserianhanzaei commented 3 years ago

@zongshenmu @vedanuj @chen398936790 @ZhiyuanChen

I wrote a step-by-step tutorial on how to set up the environment, train and test this model. I also added a section on extracting the visiolinguistic embeddings from the image-text data. https://naserian-elahe.medium.com/vilbert-a-model-for-learning-joint-representations-of-image-content-and-natural-language-47f56a313a79 I very much appreciate any comments or suggestions.