Tensorflow probability very slow on GPU (even with XLA) for some models

roblem commented 4 years ago

This is an update to issue #893 which was closed because XLA compile was failing making apples to apples comparisons impossible. XLA compile was fixed with #908 and I include updated benchmarks here. Therefore this issue supersedes #893 and all benchmarks there should be ignored.

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes (links below)
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04 using Docker for tf 2.2 release candidate 2 (rocm/tensorflow:rocm3.3-tf2.2-rc2-dev)
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
TensorFlow installed from (source or binary):
TensorFlow version (use command below): In [2]: '2.2.0-rc2': 'v2.2.0-rc2-1-ga7e18b56c6'
Python version: 3.6.9
Bazel version (if compiling from source): N/A
GCC/Compiler version (if compiling from source): N/A
CUDA/cuDNN version: ROCM
GPU model and memory: Radeon VII
All benchmarks are run using rocm docker images using the upstream linux drivers.

Describe the current behavior I have run some benchmarks under both ROCM (on Radeon VII) and Cuda (on P100 and V100) stacks. For the most part, ROCM is very much on par with Cuda but in one example is failing fairly spectacularly in terms of runtimes (but does eventually return a reasonable result).

Describe the expected behavior I would expect similar runtimes across all models considered here.

Standalone code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem. If possible, please share a link to Colab/Jupyter/any notebook.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached. rocm.log The biggest difference I see in comparing the CUDA and ROCM logs is the repeated warning:

2020-04-22 09:51:02.885876: W ./tensorflow/compiler/xla/service/hlo_pass_fix.h:49] Unexpectedly high number of iterations in HLO passes, exiting fixed point loop.

I don't see this warning on the CUDA stack. I see this warning in all cases including where ROCM compares favorably to CUDA in terms of timing. Also, the code runs without modification on both CUDA and ROCM.

Timing Benchmarks

Linear Regression

I have run two benchmarks using tensorflow probability's monte carlo markov chain library. The first uses simple linear regression with large numbers of parameters and I time a few hundred samples using both random walk metropolis hastings (MHRW) and the no-turn sampler (NUTS) for generating proposals. Nuts is a gradient-based sampler whereas MHRW is not. I also include timings for the likelihood function calculation. Columns are the same code run over different software stacks/hardware. Here are timings:

ID	Description	ROCM Radeon VII	CUDA P100	Cuda V100
1	Function (CPU) [ms]	4.90	4.79	4.37
2	Function (GPU) [ms]	.45	9.71	9.61
3	MHRW Samples (CPU)	13.05	12.23	16.40
4	MHRW Samples (GPU)	2.52	.65	2.42
5	NUTS Samples (CPU)	431.02	249.73	463.42
6	NUTS Samples (GPU)	18.78	21.31	13.29

These timings are single runs using differences in time.time(). We see that the Radeon VII holds its own here against CUDA and in some cases is much faster. The one most relevant for most researchers would be the final row (NUTS Samples (GPU)). This shows that ROCM on Radeon VII falls somewhere between a P100 and V100.

The code for these is here for MHRW and here for NUTS.

Two other models

The next set of benchmarks are for two models having many more parameters. Both are implemented using custom likelihood functions. Both models are similar to softmax with the difference being Model 2 has many many more parameters. We see that ROCM on GPU performs very well for Model 1 (faster than either CUDA stack), but is spectacularly slower in Model 2. In Model 2, the CUDA stack shows an approximate 6x (3.5x) speedup on GPU relative to CPU on the V100(P100), but the ROCM stack is substantially slower on GPU to the point that you are better off running on CPU. For each model, I run a function calculation over the Log-Likelihood which is based solely on tensorflow ops. These functions are used when sampling.

ID	Description	ROCM Radeon VII	CUDA P100	Cuda V100
7	Function Model 1(CPU) [ms]	.553	.922	.664
8	Function Model 1 (GPU) [ms]	.465	1.340	2.718
9	Model 1 NUTS Samples (CPU)	3.69	8.25	6.72
10	Model 1 NUTS Samples (GPU)	2.68	3.87	2.94
11	Function Model 2(CPU) [ms]	3.54	3.80
12	Function Model 2 (GPU) [ms]	1.15	1.81
13	Model 2 NUTS Samples (CPU)	72.08	37.05	46.71
14	Model 2 NUTS Samples (GPU)	1321.29	9.59	8.57
15	Model 2 MHRW Samples (CPU)	2.42	31.69
16	Model 2 MHRW Samples (GPU)	1.88	8.17

I am hesitant to publicly share this code for this example since it is from active ongoing research that isn't published yet. I would be willing to share privately or could follow tips to debug why CUDA is 150x faster than ROCM for Model 2 on GPU. For some of these, I've been unable to run on both the P100 and V100.

jerryyin commented 4 years ago

@roblem Appreciate the efforts in coming up with benchmarks and share us with detailed results. This is to confirm that I can re-produce the Unexpectedly high number of iterations in HLO passes, exiting fixed point loop.

I can try to root cause it, but without the source code I can't guarantee to pin-point to fix the exact issues in your model 2. It is much easier for me to start the triage once you have isolated the issue to a narrower scope. In order to understand what has caused this warning message to show up, we would typically do an exclude on each op being used. For example, on your shared code. I can see the following operator being used:

tf.linalg.matvec
tf.math.log
tf.math.reduce_sum
tfp.mcmc.random_walk_normal_fn(); Of which used tf.random.normal
tfp.mcmc.sample_chain(); Of which used tf.cast, tf.while_loop

I wonder if you could do try to swap/remove any of the ops above, on a case by case manner just to look at how the runtime changes (Do worrying about the correctness temporarily). You will likely to be able to observe the biggest runtime drop after removing one ops of the above list. Let me know if this works/ you can isolate it to the specific op. Looking at the list above, I think it is less likely to be related with tf.linalg.matvec. I would prioritize apply the technique to tf.math.reduce_sum and tfp.mcmc.sample_chain().

roblem commented 4 years ago

The warning Unexpectedly high number of iterations in HLO passes, exiting fixed point loop. is likely irrelevant as every benchmark I ran under ROCM had this warning, even for cases where ROCM outperformed CUDA. Perhaps it has more of an impact speed-wise for some combinations of tensor flow ops.

To start triaging, here is a list of functions for each benchmark result submitted above, I have relabelled each row in the table to make this easier to compare with times above. Each row in the timings above are now in columns. Note that tfp ops are from tensorflow probability.

Op	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
`tf.linalg.matvec`	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x
`tf.math.log`	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x
`tf.math.reduce_sum`	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x	x
`tf.reshape`			x	x	x	x	x	x	x	x	x	x
`tf.math.exp`							x	x	x	x	x	x	x	x	x	x
`tf.multiply`			x	x	x	x	x	x	x	x	x	x
`tf.divide`			x	x	x	x	x	x	x	x	x	x
`tf.math.bincount`				x	x	x	x	x	x	x	x	x	x
`tf.math.reduce_max`						x	x	x	x	x	x
`tf.math.linalg.norm`						x	x	x	x	x	x
`tf.cond`						x	x	x	x	x	x
`tf.greater_equal`						x	x	x	x	x	x
`tf.equal`						x	x	x	x	x	x
`tf.transpose`						x	x	x	x	x	x
`tf.scatter_nd`						x	x	x	x	x	x
`tf.gather_nd`						x	x	x	x	x	x
`tfp.distributions.Normal().log_prob()`					x	x	x	x	x	x
`tfp.mcmc.random_walk_normal_fn`			x	x				x	x
`tfp.mcmc.RandomWalkMetropolis`			x	x				x	x
`tfp.mcmc.NoUTurnSampler`					x	x	x	x	x	x
`tfp.mcmc.DualAveragingStepSizeAdaptation`					x	x	x	x	x	x
`tfp.mcmc.sample_chain`			x	x	x	x	x	x	x	x	x	x

Some thoughts on where to start triaging:

Tensorflow ops appearing in all benchmarks are probably not the problem.
The tensorflow probability functions in and of themselves are not the problem since there are examples of NUTS and MHRW sampling that can be run quickly on the ROCM GPU (for example see results sets 4, 6, and 10)
The tensorflow ops used in Model 2 (see columns 11 and 12 for benchmark results) for the custom function used by the tensorflow probability sampling routines are not the problem since the function calculation in Column 12 is faster on the Radeon VII than the P100
The issue occurs when the function in Model 2 and tf.mcmc.NoUTurnSampler are used together. Perhaps this is due to some type of optimization in CUDA for difficult sampling problems since Model 2 has thousands of parameters whereas Model 1 only has a few.
I am going to try Models 1 and 2 with the tfp.mcmc.RandomWalkMetropolis sampling method. This has a much simpler sampling mechanism and add it to the benchmarks. If this runs fast on ROCM, then it would seem tf.mcmc.NoUTurnSampler is the culprit.

roblem commented 4 years ago

I have added two additional rows to the benchmarks table (15 and 16) and shown the functions used in the previous comment. Model 2 with Random Walk Metropolis runs very fast on the Radeon VII. This is consistent with the fast run times of the tensorflow function execution time (12). Judging by this, the problem seems to be in either tfp.mcmc.NoUTurnSampler or tfp.mcmc.DualAveragingStepSizeAdaptation.

roblem commented 4 years ago

Focusing on Model 2 here is a summary of times with different step methods (these are all based on timings using GPU only (I include ID for matching to things above where there is a match). All sampling and adaptive step methods are tensorflow probability ops (e.g. tfp.mcmc.NoUTurnSampler).

ID	Sampling Method	Adaptive Step Method	ROCM Radeon VII	CUDA P100
	`NoUTurnSampler`	None	37.39	6.62
	`NoUTurnSampler`	`SimpleStepSizeAdaptation`	87.81	6.70
14	`NoUTurnSampler`	`DualAveragingStepSizeAdaptation`	1476.71	8.35
16	`RandomWalkMetropolis`	None	1.88	2.97

Conclusions:

NoUTurnSampler without any adaptive step method leads to around a 6x slowdown compared to CUDA for same model
NoUTurnSampler with either dual averaging or simple step size adaptation shows considerably more slowdown compared to CUDA. With CUDA, these add at most around a 25% slowdown whereas with ROCM the slowdown is very large when both NUTS and Dual Averaging is used.
ROCM is not inherently slower under tensorflow probability since Random-Walk Metropolis with no adaptive step method is around 1.5x slower under CUDA than ROCM.

It is important to note that ALL of these timings use the same tensorflow ops and that function calculations using these ops are faster under ROCM than CUDA- as was demonstrated by comparing runtimes from prior benchmarks (ID 12 ).

jerryyin commented 4 years ago

Thanks for the effort. We discussed offline and confirmed that non-XLA mode also suffers in Model2.

I rerun the test_nuts.py provided and did quick sanity check on the XLA warning as well as device placement but nothing stand out. I'm afraid two things are going on here:

Need to understand why ROCm XLA produce the warning "Unexpectedly high number of iterations...". Under the hood this seems to be indicating the pass always changes the hlo code and never converges. I have recorded that HloPassFix has been running on following passes: simplification, algsimp, fusion. Some improper implementation should have affected the ROCm platoform. However, this should only affect compilation behavior, not runtime behavior. (Maybe functional behavior if the optimization is done wrong.)
The performance issue, which according to @roblem narrowed down to NoUTurnSampler. By running with test_nuts.py I can confirm that the GPU (non-XLA) is running slower than CPU too. That gives me a baseline script to test with. Further work needs to be done to isolate to the specific kernel that has dragged down the performance. roctracer may be a good candidate for doing this.

roblem commented 4 years ago

Hi,

I just reran test_nuts.py on a standard tensorflow install (no rocm or cuda) and ran on cpu. It is also showing ./tensorflow/compiler/xla/service/hlo_pass_fix.h:49] Unexpectedly high number of iterations in HLO passes, exiting fixed point loop warnings. So I'm not sure the warning is necessarily the problem as it seems to always appear when running this code.

jerryyin commented 4 years ago

Thanks for your comments. Yes, I also tend to think the performance issue is orthogonal to the warning message. More triage needs to be done, and I will keep the issue open in the mean time.

ROCm / tensorflow-upstream