Add Multiprocessing with Multi-GPU option for data parallelism & training to speed up training

Kismuz / btgym

Scalable, event-driven, deep-learning-friendly backtesting library

https://kismuz.github.io/btgym/

GNU Lesser General Public License v3.0

981 stars 259 forks source link

Add Multiprocessing with Multi-GPU option for data parallelism & training to speed up training #62

Closed developeralgo8888 closed 5 years ago

developeralgo8888 commented 6 years ago

@Kismuz , Please can you add the GPU options so that multi-processing works with GPUs and CPUs . Right now it only works with CPUs. The Data Parallelism using MultiGPU training can be done on the GPUs ( Heavy lifting ) and then the gradient policy updates can be done on the CPU. That will speed up a lot of things.

Kismuz commented 6 years ago

@developeralgo8888 , see: #26, #48

Apart from algorithms side , the greatest speed bottleneck is actually a btgym/backtrader itself as environment iteration is pure python-based and therefore quite slow. To achieve significant speedup one should reimplement backtrader engine and btgym shell from scratch with lower-level language like C.

developeralgo8888 commented 5 years ago

@Kismuz , Are you planning to reimplement it in Pure C ? . Not Sure how easy or difficult that would be . And probably would be very time consuming. Python uses a C variant -- Cython is a compiled language that generates CPython extension modules. Cython is a superset of the Python, designed to give C-like performance with code that is written mostly in Python.

I believe backtrader is uses optimized Cython to generate the executables . Of course it will not be very fast as Pure C but close to it.

Since Tensorflow , Backtrader and Open AI gym , actually use some form of C-like language in the Backend , its ok for now. But the Heavy Lifting ( training the model ) needs to go on GPUs, the CPUs can't do it even with multiprocessing

developeralgo8888 commented 5 years ago

@Kismuz , Once the python code is written you simply need to cythonize it and it will be as fast as Pure C.

Cython code, unlike Python, must be compiled. This happens in 3 stages:

---- Cythonize all the python files .py and it will generate cython files .pyx
---- The .pyx files are then compiled by Cython to C-like .c files.
---- The C language .c files are compiled by a C compiler to .so files(or a .pyd files on Windows)

Kismuz commented 5 years ago

@developeralgo8888 ,

Are you planning to reimplement it in Pure C

Actually I don't, at least by near future. I think of BTgym as of research-driven project and such kind of optimisation is beyond my scope until some good core solutions will be found; current performance limitations roots from algorithmic and math side, not from low-M iterations I believe;

Heavy Lifting ( training the model ) needs to go on GPU

No one underestimates GPU power; it is nice and current BTgym algorithms framework could be adapted to synchronous version known as A2C, see here: https://blog.openai.com/baselines-acktr-a2c/ Still I don't plan any GPU opt just because I don't have access to any decent GPU to run tests, debug etc. All btgym code has been written and tested with old i7 Imac and I like it cause it forces to optimise math instead of threads :)