cocoon2wong / E-Vertical

Official implementation of the paper "Another Vertical View: A Hierarchical Network for Heterogeneous Trajectory Prediction via Spectrums"
https://cocoon2wong.github.io/E-Vertical/
GNU General Public License v3.0
4 stars 2 forks source link

ValueError: Find `nan` values in the loss dictionary, stop training... #3

Open PrafulSinghal-19 opened 6 months ago

PrafulSinghal-19 commented 6 months ago

Hi,

I hope this message finds you well. Thanks a lot for your wonderful work. We came across with some error while working with the code which we have described below.

We were training your model on the sdd dataset, but while training the model it gave this error ValueError: Findnanvalues in the loss dictionary, stop training... Best metrics obtained from the last epoch: {'avgKey(Metrics)': 23.03153, 'FDE(Metrics)': 31.826933}. We also tried using grads, _ = tf.clip_by_global_norm(grads, 5.0) but it also didn't work.

image

Looking forward for your reply.

Thanks, Praful

cocoon2wong commented 6 months ago

Hi @PrafulSinghal-19, and sorry for the late reply!

Can you provide us with the TRAINING command you are using? I will check what is causing this issue.

PrafulSinghal-19 commented 6 months ago

Hi @cocoon2wong thanks for your reply.

I am running the following command for training: python main.py --model eva --key_points 3_7_11 --T fft --split sdd --model_name gpu_test.

Also, there is one more concern. When I run the code on my CPU then this error pops up ValueError: Find nan values in the loss dictionary, stop training... but while training it on the GPU I get an additional error F external/local_tsl/tsl/platform/default/env.cc:74] Check failed: ret == 0 (11 vs. 0)Thread tf_ creation via pthread_create() failed.

cocoon2wong commented 6 months ago

@PrafulSinghal-19 Okay, I'll try this command to check what went wrong. This may take some time and I will get back to you once I have a definitive conclusion.

From the command you provided, I guess it might be because the default learning rate and batchsize (0.001 and 5000) settings, as these defaults are not designed for E-V$^2$-Net. In the weights we provided (https://github.com/cocoon2wong/E-Vertical/releases), these settings on the SDD dataset (EV_co_DFT_sdd/args.json) are

--batch_size 2000 --lr 0.0004

My guess is that it's because of the excessive learning rate and batch size in the default settings that makes training difficult. I'm starting some training to verify this idea, and if this is indeed the case we'll modify the default values above to reduce the confusion for our readers. You can also try training with the settings above (2000 and 0.0004).

I will get back to you later, and I hope these might help you!

PrafulSinghal-19 commented 6 months ago

@cocoon2wong I tried running the code with the following command python main.py --model eva --key_points 3_7_11 --T fft --split sdd --model_name gpu_test --gpu 2 --batch_size 2000 --lr 0.0004 but the same error popped up ValueError: Findnanvalues in the loss dictionary, stop training... Best metrics obtained from the last epoch: {'avgKey(Metrics)': 14.08198, 'FDE(Metrics)': 16.021269}. image

cocoon2wong commented 6 months ago

Update: Since we have now moved to torch, we tried to rebuild the tensorflow environment, but we ran into a serious problem, which is the strict version correspondence between tensorflow and cuda. We can't change the version of cuda on the server freely, resulting in the fact that tensorflow is currently only able to train at a very slow speed on the CPU, so it may take us longer to troubleshoot the error.

Here are two ways that might help:

This is a valuable question and we will continue to troubleshoot to determine if it is a tensorflow issue. I'll message you once we get some new updates. We hope that we can help you!

(cc @northocean @livepoolq)