Closed Fen9 closed 3 years ago
Do you mean the result with VQA pretraining? Yeah, for that number you have to train on VQA first, that config file is here: https://github.com/facebookresearch/mmf/blob/master/projects/krisp/configs/krisp/vqa2/krisp_pretrain.yaml
Also, make sure you actually run your final model through evaluation and run the result file through a VQA eval. Because if batching, the eval at the end of training is not necessarily the correct number because it cuts off the end of the eval set.
Thank you for reply, and yes, I am referring to the pre-training of the MMBERT on VQA2.0. If I am understanding it correctly, I first need to use the config you mentioned to train the MMBERT. Then after this training is completed, I use the config that is provided in the README?
Right
On Mon, Jun 28, 2021 at 7:27 PM Feng Gao @.***> wrote:
Thank you for reply, and yes, I am referring to the pre-trained of the MMBERT on VQA2.0. If I am understanding it correctly, I first need to use the config you mentioned to train the MMBERT. Then after this training is completed, I use the config that is provided in the README?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/mmf/issues/988#issuecomment-870136599, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFBIATPNRH6NAG4DGY4BODTVEHOLANCNFSM47OQNYCA .
Thank you very much. I appreciate it. I will come back and close the issue once I got the claimed numbers.
BTW, does the pre-training of MMBERT use the knowledge graph? It looks like it does in the config file. Just want to confirm, thanks.
Just confirm, the command the run the pre-training is "mmf_run config=mmf/projects/krisp/configs/krisp/vqa2/krisp_pretrain.yaml dataset=vqa2 model=krisp"?. It will be stuck at "Loading Model" for half-hour and output the following error:
Traceback (most recent call last):
File "/home/ubuntu/krisp/venv/bin/mmf_run", line 33, in
-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/home/ubuntu/krisp/mmf/mmf_cli/run.py", line 66, in distributed_main main(configuration, init_distributed=True, predict=predict) File "/home/ubuntu/krisp/mmf/mmf_cli/run.py", line 56, in main trainer.train() File "/home/ubuntu/krisp/mmf/mmf/trainers/mmf_trainer.py", line 142, in train self.training_loop() File "/home/ubuntu/krisp/mmf/mmf/trainers/core/training_loop.py", line 33, in training_loop self.run_training_epoch() File "/home/ubuntu/krisp/mmf/mmf/trainers/core/training_loop.py", line 91, in run_training_epoch report = self.run_training_batch(batch, num_batches_for_this_update) File "/home/ubuntu/krisp/mmf/mmf/trainers/core/training_loop.py", line 166, in run_training_batch report = self._forward(batch) File "/home/ubuntu/krisp/mmf/mmf/trainers/core/training_loop.py", line 178, in _forward model_output = self.model(prepared_batch) File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/ubuntu/krisp/mmf/mmf/models/base_model.py", line 273, in call model_output = super().call(sample_list, args, kwargs) File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/home/ubuntu/krisp/mmf/mmf/models/krisp.py", line 183, in forward graph_output = self.graph_module(sample_list) File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, **kwargs) File "/home/ubuntu/krisp/mmf/projects/krisp/graphnetwork_module.py", line 956, in forward qids = sample_list["id"] KeyError: 'id'
Traceback (most recent call last):
File "
Do you happen to have any pieces of evidence on it? Thanks.
Apologies if I'm late on answering you, I'm actually defending my thesis this week.
Yes, so first thing if you aren't doing this already is to run with a batch size of one on CPU only and put a breakpoint right before the error and try to investigate it.
I'm not totally sure why you're getting that particular error either; maybe that field should be called "question_id" rather than "id" but I did run this without errors before, so not sure why it's bugging out here.
Multi-processing does sometimes make the errors unhelpful or non-sensical, so I would run it the way I said above and that should hopefully make it more clear.
Generally, this can happen due to a lot of reasons related to memory. I would suggest running on single GPU by doing export CUDA_VISIBLE_DEVICES=0
before running the job and adding training.num_workers=0
at the end of the command to debug the error.
@apsdehal , I've tried your suggestion. However, the same bug appears.
Can you provide more debug information? Did you set a pdb set_trace and look at the contents of that dictionary like I suggested above?
I change line 956 "id" to "question_id". It works. But it needs to be trained for 500h. Is it correct? One more question, could you upload the trained parameters for the full model. Thanks.
I think that's because the evaluation interval is set too frequently. Try training.evaluation_interval=88000. That should end up at a reasonable training time
I'm also running the visual_bert with VQA 2.0 pretraining, and the ETA is very long. Will the checkpoint of the pretrained visual_bert with VQA-2.0 be released?
Try the evaluation_interval trick from above
I am working on releasing checkpoints, I just have quite a lot on my plate this week
Thanks @KMarino. The training time is now down to 90h. Looking forward to the checkpoints from you.
Will do my best to get those out as soon as I can.
Maybe @apsdehal has more suggestions about reducing your training time. Mine was FAR lower that that. Make sure you are using all your CPU compute. Also, hopefully you removed training.num_workers=0.
Also, the estimated time sometimes is wrong for the first few iterations, so it might actually be something more reasonable
@KMarino Thanks for the reply. May I ask what command did you call to pre-train? I tried to specify the value of training.num_workers to a larger number, but it didn't work. It is still very slow. The CPU usage rate is basically the same as the default parameters.
@KMarino, thank you for your replys. I've gone through the training and got the number that is close to the reported number in the paper. I am closing this issue.
❓ Questions and Help
Dear authors,
Thank you for sharing the code and I really enjoy it. I try to run the KRISP code following all the provided instructions but I found the performance does not match the number that is claimed in the paper. The accuracy that I got is around 31.73% and it seems like it is the performance of the KRISP without the MMBERT pre-trained.
Would you like to share how to incorporate the MMBERT pre-trained into the entire model? I really appreciate your help.