cms-ml / cmsml

Python Package of the CMS Machine Learning Group
https://cmsml.readthedocs.io
BSD 3-Clause "New" or "Revised" License
19 stars 6 forks source link

Add first keras callback. #4

Closed riga closed 3 years ago

riga commented 3 years ago

Based on the discussion in #3.

Things to do:

Closes #3.

riga commented 3 years ago

Can you run the tests on a GPU as well, @pfackeldey ?

pfackeldey commented 3 years ago

Hi @riga ,

I just tested it and it works like a charm :) I have a 3 small comments:

  1. I think the measured memory of the GPU device is in MiB units instead of MB (same as nvidia-smi), the difference is not significant though...
  2. Maybe one want to make this clear in a __doc__-string: this callback logs the full usage of the GPUs, if you work with other people on the same machine and they also use GPUs the logging does not represent what your process/training alone is consuming.
  3. Logging the usage of all GPUs might lead to "spam" or "unnecessary information". It is not so common to have cross-GPU-device trainings. It is more common to train on a machine with multiple GPUs, but just using a single GPU.

Fyi training loop logs look like this now 🎉 :

Epoch 1/60
1336/1336 - 5s - loss: 2749.4241 - resolution: 50.5707 - GPU 0 usage [%]: 13.8466 - GPU 0 vRAM [%]: 95.8586 - GPU 0 vRAM [MB]: 7782.9375 - val_loss: 2297.2878 - val_resolution: 47.0054 - lr: 0.0030
Epoch 2/60
1336/1336 - 4s - loss: 2315.0986 - resolution: 47.1101 - GPU 0 usage [%]: 14.6774 - GPU 0 vRAM [%]: 95.8586 - GPU 0 vRAM [MB]: 7782.9375 - val_loss: 2255.6926 - val_resolution: 46.6172 - lr: 0.0030
Epoch 3/60
1336/1336 - 4s - loss: 2280.3914 - resolution: 46.7554 - GPU 0 usage [%]: 14.9843 - GPU 0 vRAM [%]: 95.8586 - GPU 0 vRAM [MB]: 7782.9375 - val_loss: 2254.2732 - val_resolution: 46.5894 - lr: 0.0030

What do you think? Best, Peter

riga commented 3 years ago

👍 I made the stats configurable and I added the required docs.