Kismuz / btgym

Scalable, event-driven, deep-learning-friendly backtesting library
https://kismuz.github.io/btgym/
GNU Lesser General Public License v3.0
985 stars 260 forks source link

how to config the code run with GPU? #26

Closed vincetom1980 closed 5 years ago

vincetom1980 commented 6 years ago

hi Andrew, Thanks for your great work! I wonder if I can run this code on my GPU env. I've tried modify the code 'cpu:0' to 'GPU:0' but got error saying that resource are not available.

Could you pls tell me what's the correct operation?

Tom

tmorgan4 commented 6 years ago

By default Tensorflow will grab all available memory on the GPU when the first process is created and all subsequent processes will fail since no memory remains. This behavior can be changed by using the 'allow_growth' option which allows the memory for each process to dynamically expand as needed. This is covered in detail here: https://www.tensorflow.org/tutorials/using_gpu

With that said, these asynchronous algorithms are not optimized for GPU and, in my experience, perform much worse when forced to run on GPU. The A3C algorithm was released first and then a GPU friendly version called A2C was released some time later. Something similar was done with PPO as OpenAI has released a PPO2 algorithm which is optimized for GPU. These GPU optimized algorithms trade async behavior for batches which is where GPUs really perform.

It would be great to compare performance between different systems but I've noticed the global_step/sec parameter Andrew has implemented in the tensorboard monitor is greatly affected by many settings making it difficult to compare. Best I have seen to date is around 1800 global_steps/sec using A3C with close to default settings on a dual Xeon 2669 workstation with 18 workers.

vincetom1980 commented 6 years ago

tmorgan4,

Thanks for your comments!

Tom

Kismuz commented 6 years ago

@vincetom1980, yes indeed, @tmorgan4 comment is right to the point: A3C is good for those who doesn't have access to cheap GPU resources (like me :). As for performance, here is post from A2C developers: https://blog.openai.com/baselines-acktr-a2c/ which can be summarised as 'it's better to run A2C on GPU than a3C on CPU'.

global_step/sec parameter Andrew has implemented in the tensorboard monitor is greatly affected by many settings making it difficult to compare.

yes, not a good metric since here golbal_step is defined not as ' number of algorithm training iterations' but as 'number of environment steps made so far by all workers' and better be named sampling number or so. I found it more convenient for this particular task. Anyway train_global_step is easy to insert.

BTW I have included option to run several environments for each worker in a batch like this:

cluster_config = dict(
    host='127.0.0.1',
    port=12230,
    num_workers=4,  
    num_ps=1,
    num_envs=4,
    log_dir=os.path.expanduser('~/tmp/test_4_8'),
)
vincetom1980 commented 6 years ago

Andrew, Thank you for your Answer!

Tom

JaCoderX commented 5 years ago

A3C algorithm was designed to work on 'workers' that run on CPU. So running the whole framework on GPU doesn't make a lot of sense.

But what about running specific parts on GPU?

I'm currently experimenting with conv_1d_casual_encoder using a large time_dim =4096 . My problem is because it adds a lot more parameterization to the model, every step computation take considerably more time.

So I was thinking maybe I can wrap only the encoder with tf.device('/gpu:0'): and force the encoder block to run under GPU. This way everything would run on CPU except the convolution part that is know for working very well under GPU.

I've made the following changes to the code:

I ran a test using only 1 worker but I couldn't get it to work. (error: cuda is out of memory) I even get this error if I use the with tf.device('/gpu:0'): on a simple operation inside the encoder

The log Tensorflow generate shows there is an active GPU and 'tensorflow gpu' is the only version installed (and works properly).

I'm having a hard time understanding the source of the problem.

hopefully there is a solution, as CPU power is not enough to experiment on large time_dim efficiently.

Kismuz commented 5 years ago

@JacobHanouna,

A3C algorithm was designed to work on 'workers' that run on CPU. So running the whole framework on GPU doesn't make a lot of sense.

class BTgymMonitor()

...is deprecated and not related at all, do not use it; for proper place to configure distributed TF device placement see:

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/worker.py#L195

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/aac.py#L439

https://github.com/Kismuz/btgym/blob/master/btgym/algorithms/launcher/base.py#L264

some explanations:

https://www.tensorflow.org/deploy/distributed

JaCoderX commented 5 years ago

Thanks for the guidance @Kismuz I will try to give it a go :)

JaCoderX commented 5 years ago

I've been reading for a couple of hours both the code and on tensorflow distributed, not the most easy topic to follow.

this is what I understand so far:

I'm not sure how to modify the code so I have another worker that will be bound to GPU and would not be part of the A3C.

not looking for something pretty, just a way to use something like with tf.device("/job:worker/task:{}/gpu:0".format(task)): over the encoder block

Kismuz commented 5 years ago

@JacobHanouna, it is correct except that's essential to understand that it is tensorflow graph (or even specific part of it) that get assigned to specific device, not python object or process (instance of worker etc.);

In a nutshell, there are replicas of graph assigned to each worker process and one replica held by parameter server process; later receives trainable parameters updates from worker's graphs (to be exact it gets computed gradients and applies to own variables following optimiser rule); than each worker copies updated variables to own graph to work with.

That's big topic indeed with a lot of pitfalls and I do recommend to dig github for some good-written distributed code from big guys; there is no guaranties that if even one correctly assigns computation-heavy part of the graph ops to GPU device there will be no lock-ups due to workers concurrency; thats why A2C is more efficient here: it forces each worker to put it own batch in synchronous manner, concatenates everything batch-wise and sends to GPU in single pass.

tmorgan4 commented 5 years ago

@JacobHanouna Making this work on GPU will require a fair amount of rework. You are most likely getting the 'cuda is out of memory' error because Tensorflow by default will grab all available memory on the device in the first session so all other workers don't see any available memory.

You actually posted the solution above (from BTgymMonitor, which is not being used) where you need to specify 'config.gpu_options.allow_growth = True'. This will tell Tensorflow to allocate a small amount of memory to start and expand as needed. You can also specify a fraction of memory for each process to allocate if that is more convenient. It's all covered in detail under 'Allowing GPU memory growth'.
https://www.tensorflow.org/guide/using_gpu

As an aside, I have been digging through @Kismuz's code for a long time (a year?) and just finally understanding how certain parts work together. Andrew has done an extraordinary job especially considering he's done it nearly all himself.

JaCoderX commented 5 years ago

@tmorgan4, @Kismuz Thank you both for your replies.

I think for now I'll stick to CPU :)