ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
685 stars 94 forks source link

Conv3D: gemm consuming large amount of memory causing TF to crash #619

Open kk325ic opened 5 years ago

kk325ic commented 5 years ago

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

You can collect some of this information using our environment capture script You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior When execute Conv3D, it crashes. Describe the expected behavior

Code to reproduce the issue Provide a reproducible test case that is the bare minimum necessary to generate the problem.

Other info / logs Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

2 root error(s) found. (0) Not found: Failed to find conv algorithm! [[{{node conv3d_21/convolution}}]] [[metrics_1/categorical_accuracy/Mean/_3219]] (1) Not found: Failed to find conv algorithm! [[{{node conv3d_21/convolution}}]] 0 successful operations. 0 derived errors ignored.Traceback (most recent call last): File "", line 54, in File "/usr/local/lib/python3.6/dist-packages/keras/legacy/interfaces.py", line 91, in wrapper return func(*args, *kwargs) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1418, in fit_generator initial_epoch=initial_epoch) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training_generator.py", line 217, in fit_generator class_weight=class_weight) File "/usr/local/lib/python3.6/dist-packages/keras/engine/training.py", line 1217, in train_on_batch outputs = self.train_function(ins) File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2715, in call return self._call(inputs) File "/usr/local/lib/python3.6/dist-packages/keras/backend/tensorflow_backend.py", line 2675, in _call fetched = self._callable_fn(array_vals) File "/usr/local/lib/python3.6/dist-packages/tensorflow/python/client/session.py", line 1458, in call run_metadata_ptr) tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found. (0) Not found: Failed to find conv algorithm! [[{{node conv3d_21/convolution}}]] [[metrics_1/categorical_accuracy/Mean/_3219]] (1) Not found: Failed to find conv algorithm! [[{{node conv3d_21/convolution}}]] 0 successful operations. 0 derived errors ignored.

ekuznetsov139 commented 5 years ago

What code was used to produce the problem? Are there any other warnings or error messages in the output, particularly anything mentioning "miopen"?

whchung commented 5 years ago

@kk325ic could you provide a minimally reproducible script?

sunway513 commented 5 years ago

@kk325ic , can you provide the logs with the following environment variable? export MIOPEN_ENABLE_LOGGING_CMD=1

kk325ic commented 5 years ago

@kk325ic , can you provide the logs with the following environment variable? export MIOPEN_ENABLE_LOGGING_CMD=1 please see the attached log file.

a.log

jerryyin commented 5 years ago

I have verified the config works using MIOpenDriver. Judging from the log, this is one of those typical failures (#496 talking about the same thing) where conv3d gemm consuming a huge amount of memory causing TF to crash. The log show two convs with same config each consuming 3.88GB. The workaround is to run the same model on a GPU with larger memory @kk325ic

kk325ic commented 5 years ago

Unfortunately, I don't have any AMD card with memory larger than 8GB. Otherwise, I would be happy to test your workout. We are happy to see AMD is catching up with Nvidia in ML. Hopefully you guys can finish the last few touches to make it an accountable solution.

ekuznetsov139 commented 5 years ago

Actually I don't think that conv3d gemm is the culprit here. Its memory requirements are fairly modest. Running that MIOpenDriver command standalone with API tracing enabled, I see it allocating only about 20 MB of GPU memory.

It is possible that the tensorflow graph itself needs more memory than available in the system. The 3.88 GB allocation is unrelated to conv3d, it just happens to occur shortly after the initialization of miopen.

It could also be an interference problem. Normally tensorflow is in charge of memory allocation, it assumes that a certain amount of GPU memory is available, then miopen goes and steals some memory from under its nose.

A minimal script illustrating the problem would be necessary to diagnose the root cause.

@kk325ic , you could try to delete ~/.config/miopen and try to rerun, it probably won't help but there's a chance that it will. Also, if you're using dynamic memory allocation (by setting tf.ConfigProto().gpu_options.allow_growth to true), try to turn it off.

kk325ic commented 5 years ago

@kk325ic , you could try to delete ~/.config/miopen and try to rerun, it probably won't help but there's a chance that it will. Also, if you're using dynamic memory allocation (by setting tf.ConfigProto().gpu_options.allow_growth to true), try to turn it off.

@ekuznetsov139 Here is a simplied code to show the problem. Unfortunately, I cannot provide any data for you to test. test.txt

jerryyin commented 5 years ago

@ekuznetsov139

I see it allocating only about 20 MB of GPU memory.

Which stage are you talking about? How much workspace does it need for the find call, and how much does it allocate in the actual computation? I suspect the 20 MB is what MIOpenDriver allocates, not the actual algorithm.

It is possible that the tensorflow graph itself needs more memory than available in the system.

This statement does not make sense to me. I can hardly imagine a pure computation graph, after serialized, takes 3.88 GB.

ekuznetsov139 commented 5 years ago

Which stage are you talking about? How much workspace does it need for the find call, and how much does it allocate in the actual computation? I suspect the 20 MB is what MIOpenDriver allocates, not the actual algorithm.

20 MB is the sum total of all hipMalloc calls made during the execution of "./bin/MIOpenDriver conv -n 1 -c 1 -H 512 -W 512 -k 4 -y 3 -x 3 -p 1 -q 1 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1".

This statement does not make sense to me. I can hardly imagine a pure computation graph, after serialized, takes 3.88 GB.

You probably never tried to use https://github.com/tensorflow/tensor2tensor then. I sometimes get OOMs with it on a 11 GB card.

ekuznetsov139 commented 5 years ago

Okay, looks like I was wrong. This is indeed the case of gemm consuming a large amount of memory (MIOpenDriver call was not representative).

First thing the graph does is call Conv3D on image size (128,288,416), 1 channel, kernel size 3x3x3, 2 filters. MIOpen proposes to use MIOpenGEMM for this convolution, which requires 128*288*416*3*3*3*4

= 1.65 gigabytes of scratch memory. Then a bit later it does the same with image size (128,288,416), 2 channels, and that requires 3.3 gigabytes of RAM. And that's where it fails. The graph does use a lot of memory all on its own (running it on a NVIDIA GPU, I see it use 4.5 GB), so the extra 3+ GB of scratch memory requested by MIOpen is the difference between passing and failing.

Using MIOpenGEMM can't be correct or optimal in this case, I suspect that we're encountering fallback operation because the appropriate convolution kernel has not been implemented.

jerryyin commented 5 years ago

@ekuznetsov139 Thanks for carrying out the study, I think we are mostly on the same page now. The fallback behavior aligns with Daniel's comments.

I would do a bit further on clarification of the memory consumption of the graph. In TF, memory consumed by graph and peak memory consumption are two different concepts.

When your mentioned about graph consuming 4.5 GB, it must be at certain point of the graph execution that you evaluated the memory consumption is 4.5 GB (I'd be curious how you did the evaluation), which is closer to the concept of dynamic/peak consumption.

Regarding why it consumed such a large amount of memory just before the OOM happened is another topic. Usually it is related with TF try to parallelize the execution of multiple kernels together. So in the case the un-released 4.x GB may very well be a result of other existing convolutions (forward or backward), being also run and whose memory not released yet. The tricky part of all these is that we have no control of TF's dynamic behavior, nor do we have control over MIOpen's release schedule, thus my suggestion to switch to a card with more memory.

ekuznetsov139 commented 5 years ago

During training, TF typically needs to store all tensors in the pipeline at the same time. If the graph is A->B->C->D->E, the result of A will still be kept in memory when E is being computed, because it will be needed during backpropagation through B and A. So, that gives us the lower bound on memory requirements (before the internal overhead of MIOpen, etc.) It obviously wants to keep them all inside the GPU, though I think there may be some ROCm-specific code in there to let it offload tensors to the RAM temporarily (not totally clear on that). Here we have a graph that starts with 128*288*416*4 = 60 MB per layer and quickly jumps to 120 MB per layer. It all adds up pretty quickly. Switching to a card with more memory might work in this particular case, but we're two steps from being unable to run on any card. Change that first Conv3D to (256,256,512), kernel (4,4,4), 2 filters, and MIOpen is going to want 17 GB of scratch memory. IMO, falling back to MIOpenGEMM is a crutch, and a pretty bad one at that. If a proper 3D conv kernel is not available, it should be possible to emulate it with 2D convs without incurring these massive scratch memory requirements.

jerryyin commented 5 years ago

During training, TF typically needs to store all tensors in the pipeline at the same time. If the graph is A->B->C->D->E, the result of A will still be kept in memory when E is being computed...Here we have a graph that starts with 128288416*4 = 60 MB per layer and quickly jumps to 120 MB per layer.

Agreed with all this observation. Tensors are stored until backward pass are done.

Switching to a card with more memory might work in this particular case, but we're two steps from being unable to run on any card. Change that first Conv3D to (256,256,512), kernel (4,4,4), 2 filters, and MIOpen is going to want 17 GB of scratch memory....If a proper 3D conv kernel is not available, it should be possible to emulate it with 2D convs without incurring these massive scratch memory requirements.

Good point. @daniellowell: Any comments on the emulating the conv3D using conv2D approach? Is it something you guys are seeking for?