ValueError: Find `nan` values in the loss dictionary, stop training...

PrafulSinghal-19 commented 6 months ago

Hi,

I hope this message finds you well. Thanks a lot for your wonderful work. We came across with some error while working with the code which we have described below.

We were training your model on the sdd dataset, but while training the model it gave this error ValueError: Findnanvalues in the loss dictionary, stop training... Best metrics obtained from the last epoch: {'avgKey(Metrics)': 23.03153, 'FDE(Metrics)': 31.826933}. We also tried using grads, _ = tf.clip_by_global_norm(grads, 5.0) but it also didn't work.

Looking forward for your reply.

Thanks, Praful

cocoon2wong commented 6 months ago

Hi @PrafulSinghal-19, and sorry for the late reply!

Can you provide us with the TRAINING command you are using? I will check what is causing this issue.

PrafulSinghal-19 commented 6 months ago

Hi @cocoon2wong thanks for your reply.

I am running the following command for training: python main.py --model eva --key_points 3_7_11 --T fft --split sdd --model_name gpu_test.

Also, there is one more concern. When I run the code on my CPU then this error pops up ValueError: Find nan values in the loss dictionary, stop training... but while training it on the GPU I get an additional error F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_ creation via pthread_create() failed.

cocoon2wong commented 6 months ago

@PrafulSinghal-19 Okay, I'll try this command to check what went wrong. This may take some time and I will get back to you once I have a definitive conclusion.

From the command you provided, I guess it might be because the default learning rate and batchsize (0.001 and 5000) settings, as these defaults are not designed for E-V$^2$-Net. In the weights we provided (https://github.com/cocoon2wong/E-Vertical/releases), these settings on the SDD dataset (EV_co_DFT_sdd/args.json) are

--batch_size 2000 --lr 0.0004

My guess is that it's because of the excessive learning rate and batch size in the default settings that makes training difficult. I'm starting some training to verify this idea, and if this is indeed the case we'll modify the default values above to reduce the confusion for our readers. You can also try training with the settings above (2000 and 0.0004).

I will get back to you later, and I hope these might help you!

PrafulSinghal-19 commented 6 months ago

@cocoon2wong I tried running the code with the following command python main.py --model eva --key_points 3_7_11 --T fft --split sdd --model_name gpu_test --gpu 2 --batch_size 2000 --lr 0.0004 but the same error popped up ValueError: Findnanvalues in the loss dictionary, stop training... Best metrics obtained from the last epoch: {'avgKey(Metrics)': 14.08198, 'FDE(Metrics)': 16.021269}.

cocoon2wong commented 6 months ago

Update: Since we have now moved to torch, we tried to rebuild the tensorflow environment, but we ran into a serious problem, which is the strict version correspondence between tensorflow and cuda. We can't change the version of cuda on the server freely, resulting in the fact that tensorflow is currently only able to train at a very slow speed on the CPU, so it may take us longer to troubleshoot the error.

Here are two ways that might help:

Use the restore_args parameter to load the args of our pre-trained weights (https://github.com/cocoon2wong/E-Vertical/releases) directly to start a new training, and try training for several times:
```
python main.py --model eva --restore_args weights/EV_co_DFT_sdd
```
Start training in the repository of our subsequent work (torch version) at https://github.com/cocoon2wong/SocialCircle/tree/TorchVersion(beta), which is compatible with the current version of E-V^2-Net and has essentially the same training commands. You can try downloading the reference weights from https://github.com/cocoon2wong/Project-Monandaeg/tree/main/Silverbullet-Physical and start training by the following:
```
python main.py --model eva --restore_args weights/evspcsdd_adaptive
```
In the torch version of the weights repository we provide the corresponding tensorboard records, where the model basically completes the specified number of trainings without nan values in the loss.

This is a valuable question and we will continue to troubleshoot to determine if it is a tensorflow issue. I'll message you once we get some new updates. We hope that we can help you!

(cc @northocean @livepoolq)

cocoon2wong / E-Vertical

ValueError: Find `nan` values in the loss dictionary, stop training... #3