facebookresearch / mmf

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)
https://mmf.sh/
Other
5.5k stars 939 forks source link

[Hateful memes challenge] ValueError when training mmbt model #857

Closed dinhanhx closed 3 years ago

dinhanhx commented 3 years ago

❓ Questions and Help

Hello everyone, on Google Colab as of using in 2021 April Python version is Python 3.7.10

I set up things as follow

!pip install git+https://github.com/facebookresearch/mmf.git

Then I downloaded and converted the dataset as follow

!curl -o "/content/hm.zip" "$url" -H 'Referer: https://www.drivendata.org/competitions/64/hateful-memes/data/' --compressed
!mmf_convert_hm --zip_file "/content/hm.zip" --password $password --bypass_checksum=1

Then I tried to train a model

!mmf_run config=projects/hateful_memes/configs/mmbt/defaults.yaml model=mmbt dataset=hateful_memes run_type=train_val

Starting from 500/22000, I notice that cross_entropy is nan

2021-04-05T09:59:09 | mmf.utils.general: Total Parameters: 169793346. Trained Parameters: 169793346
2021-04-05T09:59:09 | mmf.trainers.core.training_loop: Starting training...
2021-04-05T10:00:41 | mmf.trainers.callbacks.logistics: progress: 100/22000, train/hateful_memes/cross_entropy: 0.6800, train/hateful_memes/cross_entropy/avg: 0.6800, train/total_loss: 0.6800, train/total_loss/avg: 0.6800, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 100, iterations: 100, max_updates: 22000, lr: 0., ups: 1.09, time: 01m 32s 478ms, time_since_start: 01m 32s 545ms, eta: 05h 43m 17s 017ms
2021-04-05T10:02:11 | mmf.trainers.callbacks.logistics: progress: 200/22000, train/hateful_memes/cross_entropy: 0.6616, train/hateful_memes/cross_entropy/avg: 0.6708, train/total_loss: 0.6616, train/total_loss/avg: 0.6708, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 200, iterations: 200, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 941ms, time_since_start: 03m 02s 486ms, eta: 05h 32m 20s 526ms
2021-04-05T10:03:43 | mmf.trainers.callbacks.logistics: progress: 300/22000, train/hateful_memes/cross_entropy: 0.6800, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: 0.6800, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 300, iterations: 300, max_updates: 22000, lr: 0., ups: 1.10, time: 01m 31s 964ms, time_since_start: 04m 34s 451ms, eta: 05h 38m 15s 542ms
2021-04-05T10:05:13 | mmf.trainers.callbacks.logistics: progress: 400/22000, train/hateful_memes/cross_entropy: 0.6800, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: 0.6800, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 400, iterations: 400, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 758ms, time_since_start: 06m 04s 209ms, eta: 05h 28m 37s 422ms
2021-04-05T10:06:43 | mmf.trainers.callbacks.logistics: progress: 500/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 500, iterations: 500, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 806ms, time_since_start: 07m 34s 016ms, eta: 05h 27m 16s 697ms
2021-04-05T10:08:15 | mmf.trainers.callbacks.logistics: progress: 600/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 600, iterations: 600, max_updates: 22000, lr: 0., ups: 1.10, time: 01m 31s 754ms, time_since_start: 09m 05s 770ms, eta: 05h 32m 49s 189ms
2021-04-05T10:09:44 | mmf.trainers.callbacks.logistics: progress: 700/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 700, iterations: 700, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 800ms, time_since_start: 10m 35s 571ms, eta: 05h 24m 12s 705ms
2021-04-05T10:11:16 | mmf.trainers.callbacks.logistics: progress: 800/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 800, iterations: 800, max_updates: 22000, lr: 0., ups: 1.10, time: 01m 31s 414ms, time_since_start: 12m 06s 986ms, eta: 05h 28m 29s 372ms
2021-04-05T10:12:46 | mmf.trainers.callbacks.logistics: progress: 900/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 900, iterations: 900, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 923ms, time_since_start: 13m 36s 909ms, eta: 05h 21m 36s 350ms
2021-04-05T10:14:16 | mmf.trainers.callbacks.checkpoint: Checkpoint time. Saving a checkpoint.
2021-04-05T10:14:16 | mmf.utils.checkpoint: Checkpoint save operation started!
WARNING 2021-04-05T10:14:16 | py.warnings: /usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
  warnings.warn(SAVE_STATE_WARNING, UserWarning)

WARNING 2021-04-05T10:14:16 | py.warnings: /usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler.
  warnings.warn(SAVE_STATE_WARNING, UserWarning)

2021-04-05T10:15:30 | mmf.utils.checkpoint: Saving current checkpoint
2021-04-05T10:16:20 | mmf.utils.checkpoint: Checkpoint save operation finished!
2021-04-05T10:16:20 | mmf.trainers.callbacks.logistics: progress: 1000/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 1000, iterations: 1000, max_updates: 22000, lr: 0.00001, ups: 0.47, time: 03m 34s 323ms, time_since_start: 17m 11s 232ms, eta: 12h 42m 53s 051ms
2021-04-05T10:16:20 | mmf.trainers.core.training_loop: Evaluation time. Running on full validation set...

When it went to | mmf.common.test_reporter: Predicting for hateful_memes, it yielded this error

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

How can fix this?

Please help me, thank you in advance

vedanuj commented 3 years ago

I can reproduce the issue. We are checking it. Thanks for reporting.

vedanuj commented 3 years ago

PR #855 Should fix this issue.