ZhangAoCanada / RADDet

Range-Azimuth-Doppler Based Radar Object Detection
MIT License
160 stars 39 forks source link

nan values for loss when running on GPU #37

Open Timotheevin opened 1 year ago

Timotheevin commented 1 year ago

Hello, I managed to train the model on CPU and it worked fine but when running on GPU, starting from training step n.2, loss prints nan. I tried several versions of cuda toolkit, cudnn and tensorflow but no one seems to solve the issue. I also tried to reduce batch size, learning rates (all of them) but it doesn't fix the issue. I figured out that it is the backbone that outputs infinite values, but I didn't go deeper yet.

Would you have any ideas to fix this ?

dongyu-du commented 9 months ago

Have you solved this? I met the same question.

Timotheevin commented 9 months ago

Not really, the only thing I found is to force the operations that output nan to run on the CPU using :

with tf.device('/CPU:0'):
   # instructions

this way it is a bit faster but not optimal at all

dongyu-du commented 9 months ago

I can train the model on A100 GPU now, my environment is as follows for your reference: tensorflow == 2.5.0 cuda == 11.1 cudnn ==8.2.0

Timotheevin commented 9 months ago

Thanks, I'll try that.

Gabo181 commented 7 months ago

Did you figure out, why this happens? Same problem with tensorflow==2.10

Training works finde with tensorflow v2.15.0 and directML plugin. Also the training via CPU works as intended. Only older versions and GPU training results in nan values.

Gabo181 commented 7 months ago

This workflow (via Conda) did it for me:

conda create -n tensorflow_23 python=3.8 conda activate tensorflow_23 conda install -c anaconda cudatoolkit=10.1.243 conda install -c anaconda cudnn= 7.6.5

pip install tensorflow==2.3 opencv-python==4.1.2.30 numpy==1.18.5 matplotlib==3.3.1 scikit-learn==0.23.2 tqdm==4.50.2 scikit-image==0.17.2

huiwenXie commented 3 months ago

Hello, have you solved this probelm? I met it these days. Can I get some help or advise from you? Thanks a lot !!

dongyu-du commented 3 months ago

I just used CPU to train

huiwenXie commented 3 months ago

I just used CPU to train

Thanks

huiwenXie commented 3 months ago

This workflow (via Conda) did it for me:

conda create -n tensorflow_23 python=3.8 conda activate tensorflow_23 conda install -c anaconda cudatoolkit=10.1.243 conda install -c anaconda cudnn= 7.6.5

pip install tensorflow==2.3 opencv-python==4.1.2.30 numpy==1.18.5 matplotlib==3.3.1 scikit-learn==0.23.2 tqdm==4.50.2 scikit-image==0.17.2

Hello, I cannot understand what do you mean. Do you mean that It works well on GPU by using this workflow (via Conda) you give? After using the workflow you give, will 'nan' value disappear?

dongyu-du commented 3 months ago

This is not my workflow. I remember I also try this, but it didn't work.