Using Tensorflow DirectML Plugin & distributed learning

ZhangAoCanada / RADDet

Range-Azimuth-Doppler Based Radar Object Detection

MIT License

160 stars 39 forks source link

Using Tensorflow DirectML Plugin & distributed learning #45

Open Gabo181 opened 7 months ago

Gabo181 commented 7 months ago

Hi!

All Versions under 2.10 result in NAN Values for the loss etc. (as mentioned by another user). Therefore i am using v2.10-cpu with directML plugin, as windows native isnt supported anymore.

I noticed that my GPU (RTX 4090) is only running at 10% capacity and memory only 4gb. Is there a way to

enable distributed learning (i have two RTX 4090)
Force the model to use 100% of the GPU Memory?

thanks in advance

IqbalBan commented 4 months ago

Hi,

I am also using a RTX 4090, and have been struggling with this NAN loss problem. How has the DirectML plugin fared for you? Is the speed good and would you recommend it, or have you swapped to a different GPU?

IqbalBan commented 4 months ago

I implemented the DirectML plugin and the speeds have been nice, though I am not sure if they would have been better with CUDA.

As for your question, I believe the lower use of the GPU memory is due to the original batch size of 3. I was able to get more memory use with batch size of 16 (32 would not run). However, I do not want to change the initial conditions used by the original researcher too much, so currently I am training with a batch size of 4.