Open auphofBSF opened 5 years ago
Hey!
As far as I know, Tensorboard doesn't provide any information with respect to resources usage. However, I'm not an expert, so I'll read about it later to see what happens. On my experience, resourced can be tracked with independent tools, I usually use htop for cpu & memory, and watch -n 0.5 nvidia-smi
for GPU usage. I know it's not the best option, but at least is something. It migth be not that easy, since it'll depend on your os and hardware (specially for GPU).
With respect to metrics on training, I think tensorboard doeas a good job. However, therare still two points left: I onl
tensorboard --logdir=data_directory/logs
and then open it on your browsern't see it being useful while evaluating
Hey! As far as I know, Tensorboard doesn't provide any information with respect to resources usage. However, I'm not an expert, so I'll read about it later to see what happens. On my experience, resourced can be tracked with independent tools,
I have found some useful python bindings to Nvida Management Libraries and there is python psutil which can all provide the memory, gpu, cpu, process etc metrics They appear to support all platforms. Here is a notebook where I tested and documented my research https://colab.research.google.com/drive/1x2f6Lt6aX_WNdLz5uIqf6EaoBUG1gzYo
A simple or extensible set of metrics can then be easily retrieved if a NVML binding and PSutils is added into imageai and exposed by possibly a set of objects and then exposed by something like Prometheus_client for simple external monitoring of Resources . The more detailed Deep Learning analytics is then done through Tensorboard
With respect to metrics on training, I think tensorboard doeas a good job. However, therare still two points left: I onl
- it is added on training for custom object, we should add it on other modules (image classification trainig)
- I didn't update the docs. To read what is currently being logged to tensorboard, you need to run
tensorboard --logdir=data_directory/logs
and then open it on your browser
Thanks , I will try this on my current object detection learning runs
- I'm not sure how is it usefull while evaluating the model once it's trained. However, the evolution of the error on evaluation set is currently being tracked while training Agreed but this is were the psutil, nvml metrics are useful
n't see it being useful while evaluating
In my above notebook I also list an article https://towardsdatascience.com/measuring-actual-gpu-usage-for-deep-learning-training-e2bf3654bcfd, I think WANDB cover a bit of what Tensorboard do and does require some deep embedding in the code, so I dont think it is appropriate now, but it has pointed me in the right direction of finding resource metrics that can easily be embedded
Hey! sorry for taking so much days to answer.
I read your notebook, this looks promising. However, this may add more restrictions to the library: consider the following scenarios: what if GPU is not available on the environment that runs the training? what if it is available but it's not nvidia? that's why I believe that it is better to detach the resources consumption phase from the training one.
One more thing: consider that if we want to see the resources usage by the process that is running, it may be hard because the same process that training need to measure its resources usage, so when it goes to see what it's using, it changes its behaviour (now it's easuring resources instead of training.)
With respect to the training process on performance, I don't know why the fit method is being called with verbose=2
. This is why the actual training progress is not being shown on screen, and just the number of epochs run is shown. Changing it to verbose=1
(or removing, because its the default value) shows the following while training:
I'm not sure why is it currently as it is (verbose=2
), this is an easy change that helps a lot. Would like to hear @OlafenwaMoses thougths on this.
@auphofBSF did you try to see training progress on tensorboard?
With respect to the verbose flag, for what I can see on master, it's already merged here it was added on this commit by @OlafenwaMoses
great news for me :smile: great work :muscle:
Should add the same on trainnig for others just to keep them coherent
Wow! This is one incredible discuss here @rola93 and @auphofBSF . I really appreciate all the effort, tinkering and insights you all do to make this library become better. Allow me to make my humble comments on the matters raised.
verbose=2
value during training, my apologies for that bug there. I appreciate you calling my attention to it @rola93 .
For custom object detection training and other model training and for model evaluation functions provide More detailed and realtime progress, performance, resource usage metrics
@rola93 has provided great access to the final evaluation metrics (specifically mAP) of a custom object detection model in #302.
Ideally some new enhancements will further allow some granularity and frequency control of interval of a set of realtime basic performance and resource usage metrics on any training or evaluation progress.
I extensively use the module prometheus_client
import prometheus_client
for monitoring of long running python applications, its lightweight and easy to add and just works, this could work for this enhancement proposalAlternatively @rola93 has incorporated Tensorboard support in PR #291 which is a terrific step in detailed model training analysis but I don't know how to use it for some basic progress metrics.
I look forward to other suggestion and possibly putting this on enhancements list.
Originally posted by @auphofBSF in https://github.com/OlafenwaMoses/ImageAI/pull/302#issuecomment-523647338