Chapter 1 | sampling from the chain | TFP version

Pindar777 commented 5 years ago

Hi there,

on my system with tf-gpu this code makes trouble: [ lambda_1_samples, lambda_2_samples, posterior_tau, ], kernel_results = tfp.mcmc.sample_chain( num_results=1000, num_burnin_steps=100, current_state=initial_chain_state, kernel=tfp.mcmc.TransformedTransitionKernel( inner_kernel=tfp.mcmc.HamiltonianMonteCarlo( target_log_prob_fn=unnormalized_log_posterior, num_leapfrog_steps=2, step_size=step_size, step_size_update_fn=tfp.mcmc.make_simple_step_size_update_policy(), state_gradients_are_stopped=True), bijector=unconstraining_bijectors))

As it seems tf-prob V0.5 does not have decleration on tf-gpu in OpKernel. Hence, tf-gpu starts to work on the gpu and then copies the data to the cpu. As a result running the code as is on tf-gpu leads to a time-factor of 5 compared to running it solely on cpu (with enabled eager-execution).

My fix is to add the following line in before:

with tf.device('/cpu:0'):

CamDavidsonPilon commented 5 years ago

cc @matthew-mcateer

matthew-mcateer commented 5 years ago

@Pindar777 Hmmm, thanks for letting me know. I'll take a look at this. Is there any kind of stack trace you get with this line?

Pindar777 commented 5 years ago

@matthew-mcateer Actually there was no message at all apart from the error when using with tf.device('/gpu:0'): Let me know if there is a way how I can produce a stack trace or other output in order to help you to debug the issue.

I just tried the "A/B testing" example in Chp 2 and the tfp.mcmc.sample_chain runs on my system 4.2 times slower compared to assining the cpu-device. The other way around would be great :-)

Pindar777 commented 5 years ago

@matthew-mcateer FYI

There is a significant difference in terms of speed when running TFP in session or eager mode Example is the "A/B testing" example in Chp 2 on my machine (Windows 10, TF 1.12, TFP 0.5,...) I used the process_time() function. (it seems to me that the time elapsed in eager mode is actually even less than reported) As far as I understood the eager-execution feature it should be not be (considerable) slower then using graphs.

Time spans "A/B testing"

with eager and 1000 obs: 0.2 min

without eager and 1000 obs: 0.1 min

with eager and 10000 obs: 2.5 min

without eager and 10000 obs: 0.4 min

with eager and 25000 obs: 6.1 min

without eager and 25000 obs: 1.1 min

with eager and 50000 obs: >6.1 min ;-)

without eager and 50000 obs: 2.2 min

Pindar777 commented 5 years ago

Some more descriptive statistics on the usage of ressources:

with tf.device('cpu:0'): 80-90% CPU usage and some GPU around 30%
without decleration of devices: 50% CPU and the same GPU usage around 30%

Pindar777 commented 5 years ago

Hi @matthew-mcateer, do you already have an idea for "CPU-GPU-issue"? Cheers

Pindar777 commented 5 years ago

Hi @matthew-mcateer, is there news? Thanks in advance

Pindar777 commented 5 years ago

@matthew-mcateer I just updated to tf 1.13.1 with Cuda 10 and cuDNN 7.5 and tf-probability 0.6. Unfortunately it's still true that the "gpu-mode-sampling" is much slower (increasing in sample size) than in "cpu-mode".

The gpu-usage stats from tfp.mcmc.sample_chain and matmul running gpu and cpu mode are completely different, indicating the GPU is not used properly by tf-probability.

I guess the reason why in documented in these lines:

InvalidArgumentError (see above for traceback): Cannot assign a device for operation mcmc_sample_chain/scan/TensorArray_9: Could not satisfy explicit device specification '' because the node node mcmc_sample_chain/scan/TensorArray_9 (defined at ...\Python\Python36\site-packages\tensorflow_probability\python\mcmc\sample.py:245) having device No device assignments were active during op 'mcmc_sample_chain/scan/TensorArray_9' creation. was colocated with a group of nodes that required incompatible device '/device:GPU:0' Colocation Debug Info: Colocation group had the following types and devices: Range: GPU CPU TensorArrayV3: CPU Enter: CPU Exit: GPU CPU TensorArraySizeV3: CPU TensorArrayWriteV3: CPU Const: GPU CPU TensorArrayGatherV3: CPU

MCMC sampling is only fast when specifying CPU usage.

Do you have a working example when tf-probabliy-gpu usage is faster?

Pindar777 commented 5 years ago

@matthew-mcateer I give an update on the issue: After switching to TF 2.0_nightly and TFP_nightly there is a changed behavior:

eager mode is default, hence i leave it this way
it's still not a good idea to not specify the device, otherwise slowest performance
CPU-mode is still fastest variant
however GPU-mode does not give tracestack and is quite fast, but slower than CPU-mode

1500 iterations: Without device specification process time: 3.1 [min] GPU process time: 1.7 [min] CPU process time: 1.1 [min]

parallel_iterations seems not to have an effect
there is the warning "Tracing all kernel results by default is deprecated" https://github.com/tensorflow/probability/blob/master/tensorflow_probability/python/mcmc/sample.py

bluesky314 commented 5 years ago

I am using google colab. How to you activate GPU usage for MCMC sampling? I am finding it very slow.

And yes buolding graphs is supposed to be faster than doing in eager exectuion. How did you convert the code of Ch2 example to graph? Can you please share your code?

CamDavidsonPilon / Probabilistic-Programming-and-Bayesian-Methods-for-Hackers