facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.49k stars 935 forks source link

Unexpected Performance of KRISP #988

Closed Fen9 closed 3 years ago

Fen9 commented 3 years ago

❓ Questions and Help

Dear authors,

Thank you for sharing the code and I really enjoy it. I try to run the KRISP code following all the provided instructions but I found the performance does not match the number that is claimed in the paper. The accuracy that I got is around 31.73% and it seems like it is the performance of the KRISP without the MMBERT pre-trained.

Would you like to share how to incorporate the MMBERT pre-trained into the entire model? I really appreciate your help.

KMarino commented 3 years ago

Do you mean the result with VQA pretraining? Yeah, for that number you have to train on VQA first, that config file is here: https://github.com/facebookresearch/mmf/blob/master/projects/krisp/configs/krisp/vqa2/krisp_pretrain.yaml

KMarino commented 3 years ago

Also, make sure you actually run your final model through evaluation and run the result file through a VQA eval. Because if batching, the eval at the end of training is not necessarily the correct number because it cuts off the end of the eval set.

Fen9 commented 3 years ago

Thank you for reply, and yes, I am referring to the pre-training of the MMBERT on VQA2.0. If I am understanding it correctly, I first need to use the config you mentioned to train the MMBERT. Then after this training is completed, I use the config that is provided in the README?

KMarino commented 3 years ago

Right

On Mon, Jun 28, 2021 at 7:27 PM Feng Gao @.***> wrote:

Thank you for reply, and yes, I am referring to the pre-trained of the MMBERT on VQA2.0. If I am understanding it correctly, I first need to use the config you mentioned to train the MMBERT. Then after this training is completed, I use the config that is provided in the README?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/mmf/issues/988#issuecomment-870136599, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABFBIATPNRH6NAG4DGY4BODTVEHOLANCNFSM47OQNYCA .

Fen9 commented 3 years ago

Thank you very much. I appreciate it. I will come back and close the issue once I got the claimed numbers.

Fen9 commented 3 years ago

BTW, does the pre-training of MMBERT use the knowledge graph? It looks like it does in the config file. Just want to confirm, thanks.

Fen9 commented 3 years ago

Just confirm, the command the run the pre-training is "mmf_run config=mmf/projects/krisp/configs/krisp/vqa2/krisp_pretrain.yaml dataset=vqa2 model=krisp"?. It will be stuck at "Loading Model" for half-hour and output the following error:

Traceback (most recent call last): File "/home/ubuntu/krisp/venv/bin/mmf_run", line 33, in sys.exit(load_entry_point('mmf', 'console_scripts', 'mmf_run')()) File "/home/ubuntu/krisp/mmf/mmf_cli/run.py", line 129, in run nprocs=config.distributed.world_size, File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method='spawn') File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes while not context.join(): File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 150, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 0 terminated with the following error: Traceback (most recent call last): File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap fn(i, args) File "/home/ubuntu/krisp/mmf/mmf_cli/run.py", line 66, in distributed_main main(configuration, init_distributed=True, predict=predict) File "/home/ubuntu/krisp/mmf/mmf_cli/run.py", line 56, in main trainer.train() File "/home/ubuntu/krisp/mmf/mmf/trainers/mmf_trainer.py", line 142, in train self.training_loop() File "/home/ubuntu/krisp/mmf/mmf/trainers/core/training_loop.py", line 33, in training_loop self.run_training_epoch() File "/home/ubuntu/krisp/mmf/mmf/trainers/core/training_loop.py", line 91, in run_training_epoch report = self.run_training_batch(batch, num_batches_for_this_update) File "/home/ubuntu/krisp/mmf/mmf/trainers/core/training_loop.py", line 166, in run_training_batch report = self._forward(batch) File "/home/ubuntu/krisp/mmf/mmf/trainers/core/training_loop.py", line 178, in _forward model_output = self.model(prepared_batch) File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, kwargs) File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 799, in forward output = self.module(*inputs[0], *kwargs[0]) File "/home/ubuntu/krisp/mmf/mmf/models/base_model.py", line 273, in call model_output = super().call(sample_list, args, kwargs) File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(*input, *kwargs) File "/home/ubuntu/krisp/mmf/mmf/models/krisp.py", line 183, in forward graph_output = self.graph_module(sample_list) File "/home/ubuntu/krisp/venv/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1051, in _call_impl return forward_call(input, **kwargs) File "/home/ubuntu/krisp/mmf/projects/krisp/graphnetwork_module.py", line 956, in forward qids = sample_list["id"] KeyError: 'id'

Traceback (most recent call last): File "", line 1, in File "/usr/lib/python3.6/multiprocessing/spawn.py", line 105, in spawn_main exitcode = _main(fd) File "/usr/lib/python3.6/multiprocessing/spawn.py", line 115, in _main self = reduction.pickle.load(from_parent) EOFError: Ran out of input

Do you happen to have any pieces of evidence on it? Thanks.

KMarino commented 3 years ago

Apologies if I'm late on answering you, I'm actually defending my thesis this week.

Yes, so first thing if you aren't doing this already is to run with a batch size of one on CPU only and put a breakpoint right before the error and try to investigate it.

I'm not totally sure why you're getting that particular error either; maybe that field should be called "question_id" rather than "id" but I did run this without errors before, so not sure why it's bugging out here.

Multi-processing does sometimes make the errors unhelpful or non-sensical, so I would run it the way I said above and that should hopefully make it more clear.

apsdehal commented 3 years ago

Generally, this can happen due to a lot of reasons related to memory. I would suggest running on single GPU by doing export CUDA_VISIBLE_DEVICES=0 before running the job and adding training.num_workers=0 at the end of the command to debug the error.

Fen9 commented 3 years ago

@apsdehal , I've tried your suggestion. However, the same bug appears.

KMarino commented 3 years ago

Can you provide more debug information? Did you set a pdb set_trace and look at the contents of that dictionary like I suggested above?

Fen9 commented 3 years ago

I change line 956 "id" to "question_id". It works. But it needs to be trained for 500h. Is it correct? One more question, could you upload the trained parameters for the full model. Thanks.

KMarino commented 3 years ago

I think that's because the evaluation interval is set too frequently. Try training.evaluation_interval=88000. That should end up at a reasonable training time

ChanningPing commented 3 years ago

I'm also running the visual_bert with VQA 2.0 pretraining, and the ETA is very long. Will the checkpoint of the pretrained visual_bert with VQA-2.0 be released?

KMarino commented 3 years ago

Try the evaluation_interval trick from above

I am working on releasing checkpoints, I just have quite a lot on my plate this week

Fen9 commented 3 years ago

Thanks @KMarino. The training time is now down to 90h. Looking forward to the checkpoints from you.

KMarino commented 3 years ago

Will do my best to get those out as soon as I can.

Maybe @apsdehal has more suggestions about reducing your training time. Mine was FAR lower that that. Make sure you are using all your CPU compute. Also, hopefully you removed training.num_workers=0.

KMarino commented 3 years ago

Also, the estimated time sometimes is wrong for the first few iterations, so it might actually be something more reasonable

Fen9 commented 3 years ago

@KMarino Thanks for the reply. May I ask what command did you call to pre-train? I tried to specify the value of training.num_workers to a larger number, but it didn't work. It is still very slow. The CPU usage rate is basically the same as the default parameters.

Fen9 commented 3 years ago

@KMarino, thank you for your replys. I've gone through the training and got the number that is close to the reported number in the paper. I am closing this issue.