Open Timotheevin opened 1 year ago
Have you solved this? I met the same question.
Not really, the only thing I found is to force the operations that output nan to run on the CPU using :
with tf.device('/CPU:0'):
# instructions
this way it is a bit faster but not optimal at all
I can train the model on A100 GPU now, my environment is as follows for your reference: tensorflow == 2.5.0 cuda == 11.1 cudnn ==8.2.0
Thanks, I'll try that.
Did you figure out, why this happens? Same problem with tensorflow==2.10
Training works finde with tensorflow v2.15.0 and directML plugin. Also the training via CPU works as intended. Only older versions and GPU training results in nan values.
This workflow (via Conda) did it for me:
conda create -n tensorflow_23 python=3.8 conda activate tensorflow_23 conda install -c anaconda cudatoolkit=10.1.243 conda install -c anaconda cudnn= 7.6.5
pip install tensorflow==2.3 opencv-python==4.1.2.30 numpy==1.18.5 matplotlib==3.3.1 scikit-learn==0.23.2 tqdm==4.50.2 scikit-image==0.17.2
Hello, have you solved this probelm? I met it these days. Can I get some help or advise from you? Thanks a lot !!
I just used CPU to train
I just used CPU to train
Thanks
This workflow (via Conda) did it for me:
conda create -n tensorflow_23 python=3.8 conda activate tensorflow_23 conda install -c anaconda cudatoolkit=10.1.243 conda install -c anaconda cudnn= 7.6.5
pip install tensorflow==2.3 opencv-python==4.1.2.30 numpy==1.18.5 matplotlib==3.3.1 scikit-learn==0.23.2 tqdm==4.50.2 scikit-image==0.17.2
Hello, I cannot understand what do you mean. Do you mean that It works well on GPU by using this workflow (via Conda) you give? After using the workflow you give, will 'nan' value disappear?
This is not my workflow. I remember I also try this, but it didn't work.
Hello, I managed to train the model on CPU and it worked fine but when running on GPU, starting from training step n.2, loss prints nan. I tried several versions of cuda toolkit, cudnn and tensorflow but no one seems to solve the issue. I also tried to reduce batch size, learning rates (all of them) but it doesn't fix the issue. I figured out that it is the backbone that outputs infinite values, but I didn't go deeper yet.
Would you have any ideas to fix this ?