Realtime Metrics on progress, performance and resource usage whilst Model training or evaluation

auphofBSF commented 5 years ago

For custom object detection training and other model training and for model evaluation functions provide More detailed and realtime progress, performance, resource usage metrics

@rola93 has provided great access to the final evaluation metrics (specifically mAP) of a custom object detection model in #302.

Ideally some new enhancements will further allow some granularity and frequency control of interval of a set of realtime basic performance and resource usage metrics on any training or evaluation progress.

I extensively use the module prometheus_client import prometheus_client for monitoring of long running python applications, its lightweight and easy to add and just works, this could work for this enhancement proposal

Alternatively @rola93 has incorporated Tensorboard support in PR #291 which is a terrific step in detailed model training analysis but I don't know how to use it for some basic progress metrics.

@rola93 has commented that Tensorboard support should be possibly incorporated into other model training?
Can the Tensorboard support provide some progress info in evaluation of models ?
Is there simple example (supporting #291) be created for initiating tensorboard when training a custom.object_detection model, can this be incorporated in the documentation
With the Tensorboad support, provide a simple example showing a basic set of metrics, (Usage of CPU and GPU and memory, number of training images processed)
If tensorboard is to heavyweight for a basic set of progress metrics, determine if another callback or the use of something like prometheus_client is more suitable

I look forward to other suggestion and possibly putting this on enhancements list.

Originally posted by @auphofBSF in https://github.com/OlafenwaMoses/ImageAI/pull/302#issuecomment-523647338

rola93 commented 5 years ago

Hey! As far as I know, Tensorboard doesn't provide any information with respect to resources usage. However, I'm not an expert, so I'll read about it later to see what happens. On my experience, resourced can be tracked with independent tools, I usually use htop for cpu & memory, and watch -n 0.5 nvidia-smi for GPU usage. I know it's not the best option, but at least is something. It migth be not that easy, since it'll depend on your os and hardware (specially for GPU).

With respect to metrics on training, I think tensorboard doeas a good job. However, therare still two points left: I onl

it is added on training for custom object, we should add it on other modules (image classification trainig)
I didn't update the docs. To read what is currently being logged to tensorboard, you need to run tensorboard --logdir=data_directory/logs and then open it on your browser
I'm not sure how is it usefull while evaluating the model once it's trained. However, the evolution of the error on evaluation set is currently being tracked while training

n't see it being useful while evaluating

auphofBSF commented 5 years ago

Hey! As far as I know, Tensorboard doesn't provide any information with respect to resources usage. However, I'm not an expert, so I'll read about it later to see what happens. On my experience, resourced can be tracked with independent tools,

I have found some useful python bindings to Nvida Management Libraries and there is python psutil which can all provide the memory, gpu, cpu, process etc metrics They appear to support all platforms. Here is a notebook where I tested and documented my research https://colab.research.google.com/drive/1x2f6Lt6aX_WNdLz5uIqf6EaoBUG1gzYo

A simple or extensible set of metrics can then be easily retrieved if a NVML binding and PSutils is added into imageai and exposed by possibly a set of objects and then exposed by something like Prometheus_client for simple external monitoring of Resources . The more detailed Deep Learning analytics is then done through Tensorboard

With respect to metrics on training, I think tensorboard doeas a good job. However, therare still two points left: I onl

it is added on training for custom object, we should add it on other modules (image classification trainig)

I didn't update the docs. To read what is currently being logged to tensorboard, you need to run tensorboard --logdir=data_directory/logs and then open it on your browser

Thanks , I will try this on my current object detection learning runs

I'm not sure how is it usefull while evaluating the model once it's trained. However, the evolution of the error on evaluation set is currently being tracked while training Agreed but this is were the psutil, nvml metrics are useful

n't see it being useful while evaluating

In my above notebook I also list an article https://towardsdatascience.com/measuring-actual-gpu-usage-for-deep-learning-training-e2bf3654bcfd, I think WANDB cover a bit of what Tensorboard do and does require some deep embedding in the code, so I dont think it is appropriate now, but it has pointed me in the right direction of finding resource metrics that can easily be embedded

rola93 commented 5 years ago

Hey! sorry for taking so much days to answer.

I read your notebook, this looks promising. However, this may add more restrictions to the library: consider the following scenarios: what if GPU is not available on the environment that runs the training? what if it is available but it's not nvidia? that's why I believe that it is better to detach the resources consumption phase from the training one.

One more thing: consider that if we want to see the resources usage by the process that is running, it may be hard because the same process that training need to measure its resources usage, so when it goes to see what it's using, it changes its behaviour (now it's easuring resources instead of training.)

With respect to the training process on performance, I don't know why the fit method is being called with verbose=2. This is why the actual training progress is not being shown on screen, and just the number of epochs run is shown. Changing it to verbose=1 (or removing, because its the default value) shows the following while training:

training output with verbose=1

I'm not sure why is it currently as it is (verbose=2), this is an easy change that helps a lot. Would like to hear @OlafenwaMoses thougths on this.

@auphofBSF did you try to see training progress on tensorboard?

rola93 commented 5 years ago

With respect to the verbose flag, for what I can see on master, it's already merged here it was added on this commit by @OlafenwaMoses

great news for me :smile: great work :muscle:

Should add the same on trainnig for others just to keep them coherent

OlafenwaMoses commented 5 years ago

Wow! This is one incredible discuss here @rola93 and @auphofBSF . I really appreciate all the effort, tinkering and insights you all do to make this library become better. Allow me to make my humble comments on the matters raised.

Tensorboard logging for custom model training has been added via #81 PR.
A Tensorboard report for evaluating models doesn't seem to be a pressing issue for me. I will need more strong use case for that to be convinced it will be needed at that point.
On the verbose=2 value during training, my apologies for that bug there. I appreciate you calling my attention to it @rola93 .
Monitoring compute usage kind of goes beyond the scope of this project. Their are tools fully optimized for such tasks and ImageAI primary goal is to simplify process of creating and applying AI with less technical and theoretical knowledge of deep learning.

OlafenwaMoses / ImageAI

Realtime Metrics on progress, performance and resource usage whilst Model training or evaluation #306