Loss is NaN or Inf - Githubissues

Haiyang-W / DSVT

[CVPR2023] Official Implementation of "DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets"

https://arxiv.org/abs/2301.06051

Apache License 2.0

353 stars 28 forks source link

Loss is NaN or Inf #27

Closed zizhengu closed 1 year ago

zizhengu commented 1 year ago

In order to reduce the computational cost in my own project, I set ''feature_map_stride'' = 2 (rather than 1 in your setting) in ''TARGET_ASSIGNER_CONFIG'', I encountered the loss that becomes NaN or Inf (not during the fp16 training).
I tried three times, it didn't work. Do you know how to fix this problem? Thank you!

Haiyang-W commented 1 year ago

You may check the head code carefully. It may be that some of the parameters of the head inheritance do not match what you changed, such as voxel size and so on.

You can try it first. I am busy with some ddls and will check it a few days later.

It's not a hard problem, we've had it, and it's easy to be solved if you dig into dsvt.

Haiyang,

zizhengu commented 1 year ago

Get it, I'll check the head again. Thanks again for your patience!

Haiyang-W commented 1 year ago

To save the CUDA memory, you can try torch checkpoint, which can save 50% consumption of GPU memory.

zizhengu commented 1 year ago

I tried the different learning rate and the 0.002 was suitable for my own dataset and config, which can solve the Loss is NaN problem. Thanks for your solid work!

Haiyang-W commented 1 year ago

I'm not entirely certain that adjusting the learning rate will resolve the issue of the loss becoming NaN. I still have a suspicion that some of the hyperparameters may be misaligned, so I would strongly recommend double-checking them and maybe even getting some performance improvements. I will revisit this problem when I have available time.

Congratulations on your promotion! I wish you all the best. :) If you have any questions or concerns, please don't hesitate to bring them up.

Haiyang