Closed dinhanhx closed 3 years ago
Hello everyone, on Google Colab as of using in 2021 April Python version is Python 3.7.10
I set up things as follow
!pip install git+https://github.com/facebookresearch/mmf.git
Then I downloaded and converted the dataset as follow
!curl -o "/content/hm.zip" "$url" -H 'Referer: https://www.drivendata.org/competitions/64/hateful-memes/data/' --compressed !mmf_convert_hm --zip_file "/content/hm.zip" --password $password --bypass_checksum=1
Then I tried to train a model
!mmf_run config=projects/hateful_memes/configs/mmbt/defaults.yaml model=mmbt dataset=hateful_memes run_type=train_val
Starting from 500/22000, I notice that cross_entropy is nan
nan
2021-04-05T09:59:09 | mmf.utils.general: Total Parameters: 169793346. Trained Parameters: 169793346 2021-04-05T09:59:09 | mmf.trainers.core.training_loop: Starting training... 2021-04-05T10:00:41 | mmf.trainers.callbacks.logistics: progress: 100/22000, train/hateful_memes/cross_entropy: 0.6800, train/hateful_memes/cross_entropy/avg: 0.6800, train/total_loss: 0.6800, train/total_loss/avg: 0.6800, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 100, iterations: 100, max_updates: 22000, lr: 0., ups: 1.09, time: 01m 32s 478ms, time_since_start: 01m 32s 545ms, eta: 05h 43m 17s 017ms 2021-04-05T10:02:11 | mmf.trainers.callbacks.logistics: progress: 200/22000, train/hateful_memes/cross_entropy: 0.6616, train/hateful_memes/cross_entropy/avg: 0.6708, train/total_loss: 0.6616, train/total_loss/avg: 0.6708, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 200, iterations: 200, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 941ms, time_since_start: 03m 02s 486ms, eta: 05h 32m 20s 526ms 2021-04-05T10:03:43 | mmf.trainers.callbacks.logistics: progress: 300/22000, train/hateful_memes/cross_entropy: 0.6800, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: 0.6800, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 300, iterations: 300, max_updates: 22000, lr: 0., ups: 1.10, time: 01m 31s 964ms, time_since_start: 04m 34s 451ms, eta: 05h 38m 15s 542ms 2021-04-05T10:05:13 | mmf.trainers.callbacks.logistics: progress: 400/22000, train/hateful_memes/cross_entropy: 0.6800, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: 0.6800, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 400, iterations: 400, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 758ms, time_since_start: 06m 04s 209ms, eta: 05h 28m 37s 422ms 2021-04-05T10:06:43 | mmf.trainers.callbacks.logistics: progress: 500/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 500, iterations: 500, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 806ms, time_since_start: 07m 34s 016ms, eta: 05h 27m 16s 697ms 2021-04-05T10:08:15 | mmf.trainers.callbacks.logistics: progress: 600/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 600, iterations: 600, max_updates: 22000, lr: 0., ups: 1.10, time: 01m 31s 754ms, time_since_start: 09m 05s 770ms, eta: 05h 32m 49s 189ms 2021-04-05T10:09:44 | mmf.trainers.callbacks.logistics: progress: 700/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 700, iterations: 700, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 800ms, time_since_start: 10m 35s 571ms, eta: 05h 24m 12s 705ms 2021-04-05T10:11:16 | mmf.trainers.callbacks.logistics: progress: 800/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 800, iterations: 800, max_updates: 22000, lr: 0., ups: 1.10, time: 01m 31s 414ms, time_since_start: 12m 06s 986ms, eta: 05h 28m 29s 372ms 2021-04-05T10:12:46 | mmf.trainers.callbacks.logistics: progress: 900/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 900, iterations: 900, max_updates: 22000, lr: 0., ups: 1.12, time: 01m 29s 923ms, time_since_start: 13m 36s 909ms, eta: 05h 21m 36s 350ms 2021-04-05T10:14:16 | mmf.trainers.callbacks.checkpoint: Checkpoint time. Saving a checkpoint. 2021-04-05T10:14:16 | mmf.utils.checkpoint: Checkpoint save operation started! WARNING 2021-04-05T10:14:16 | py.warnings: /usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) WARNING 2021-04-05T10:14:16 | py.warnings: /usr/local/lib/python3.7/dist-packages/torch/optim/lr_scheduler.py:200: UserWarning: Please also save or load the state of the optimzer when saving or loading the scheduler. warnings.warn(SAVE_STATE_WARNING, UserWarning) 2021-04-05T10:15:30 | mmf.utils.checkpoint: Saving current checkpoint 2021-04-05T10:16:20 | mmf.utils.checkpoint: Checkpoint save operation finished! 2021-04-05T10:16:20 | mmf.trainers.callbacks.logistics: progress: 1000/22000, train/hateful_memes/cross_entropy: nan, train/hateful_memes/cross_entropy/avg: nan, train/total_loss: nan, train/total_loss/avg: nan, max mem: 11656.0, experiment: run, epoch: 1, num_updates: 1000, iterations: 1000, max_updates: 22000, lr: 0.00001, ups: 0.47, time: 03m 34s 323ms, time_since_start: 17m 11s 232ms, eta: 12h 42m 53s 051ms 2021-04-05T10:16:20 | mmf.trainers.core.training_loop: Evaluation time. Running on full validation set...
When it went to | mmf.common.test_reporter: Predicting for hateful_memes, it yielded this error
| mmf.common.test_reporter: Predicting for hateful_memes
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
How can fix this?
Please help me, thank you in advance
I can reproduce the issue. We are checking it. Thanks for reporting.
PR #855 Should fix this issue.
❓ Questions and Help
Hello everyone, on Google Colab as of using in 2021 April Python version is Python 3.7.10
I set up things as follow
Then I downloaded and converted the dataset as follow
Then I tried to train a model
Starting from 500/22000, I notice that cross_entropy is
nan
When it went to
| mmf.common.test_reporter: Predicting for hateful_memes
, it yielded this errorHow can fix this?
Please help me, thank you in advance