Increase training speed by testing different techniques

TunnRL / TunnRL_TBM_maintenance

Working repository for the code of the TunnRL TBM project

MIT License

1 stars 0 forks source link

Open tfha opened 2 years ago

tfha commented 2 years ago

Profiling code for time with profiling techniques to find the parts to optimize (cProfiler etc.): https://machinelearningmastery.com/profiling-python-code/
MKL:
Running on NGI Odin Machine
Running on Azure cloud (if its available for use in our NGI Azure cloud) or AWS/Google cloud etc.
Sharing database and optuna optimize from many machines against the same database-files
Make it possible to automatically kick off the number of processes as the number of cpus on a machine, ie. parallelization:
- https://wiki.python.org/moin/ParallelProcessing
- https://mpi4py.readthedocs.io/en/stable/

tfha commented 2 years ago

Number 3, 5 and 6 are now tested and ok.

tfha commented 2 years ago

Point 2 is already implemented in existing libraries

tfha commented 2 years ago

A report from profiling the train_agent method. This highlights a number of processes to handle.

Some thoughts of signifcant processes that takes time:

the implement_action method takes considerable time, 21 seconds to run the 50 000 times it is called. There might be something we can do with the loop in the function.
the step function. Perhaps there are ways to reduce the time.
the evaluate_policy function. Perhaps we should consider to reduce the number of episodes to evaluate
collect_rollouts function in SB3. I found out that this function controls the callbacks. So the callbacks takes a significant amount of time.
In the end it is the learn method, but I guess we cannot do much about that.

tfha commented 2 years ago

I have tested running on Azure but this did not work properly and I will try ones more with some new advices.