ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
685 stars 94 forks source link

Conv3D: gemm consuming large amount of memory causing TF to crash #496

Closed nuu9323226 closed 4 years ago

nuu9323226 commented 5 years ago

I run a 3d conv network. I have a problem Has anyone encountered it?

Predict volume shape: (24, 31, 31) 2019-06-06 16:58:04.595653: F tensorflow/stream_executor/rocm/rocm_dnn.cc:494] miopen only supports 4D tensors, dim=5 not allowed

GPU:radeon vii

sunway513 commented 5 years ago

@nuu9323226 could you provide the steps to repro your issue? 3D convolution support has been added in the following PR: https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/pull/381 Please use the following docker image to preview the support and let us know if that works: rocm/tensorflow:rocm2.4-tf1.14-python3-preview

CC @jerryyin

nuu9323226 commented 5 years ago

@sunway513
i install docker and run rocm/tensorflow:rocm2.4-tf1.14-python3-preview But still fail


Using TensorFlow backend. WARNING: Logging before flag parsing goes to stderr. W0611 09:44:09.838078 139787293587200 deprecation_wrapper.py:119] From step2_train_nodule_detector.py:27: The name tf.ConfigProto is deprecated. Please use tf.compat.v1.ConfigProto instead.

W0611 09:44:09.838232 139787293587200 deprecation_wrapper.py:119] From step2_train_nodule_detector.py:29: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.

2019-06-11 09:44:09.869179: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz 2019-06-11 09:44:09.870005: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x74b20d0 executing computations on platform Host. Devices: 2019-06-11 09:44:09.870022: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): , 2019-06-11 09:44:09.871167: I tensorflow/stream_executor/platform/default/dso_loader.cc:43] Successfully opened dynamic library libhip_hcc.so 2019-06-11 09:44:09.871574: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7514d80 executing computations on platform ROCM. Devices: 2019-06-11 09:44:09.871583: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (0): Device 66af, AMDGPU ISA version: gfx906 2019-06-11 09:44:09.871588: I tensorflow/compiler/xla/service/service.cc:175] StreamExecutor device (1): Device 66af, AMDGPU ISA version: gfx906 2019-06-11 09:44:09.907733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Found device 0 with properties: name: Device 66af AMDGPU ISA: gfx906 memoryClockRate (GHz) 1.802 pciBusID 0000:05:00.0 2019-06-11 09:44:09.907784: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1613] Found device 1 with properties: name: Device 66af AMDGPU ISA: gfx906 memoryClockRate (GHz) 1.802 pciBusID 0000:08:00.0 2019-06-11 09:44:09.907836: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1715] Adding visible gpu devices: 0, 1 2019-06-11 09:44:09.907852: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1145] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-06-11 09:44:09.907860: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1151] 0 1 2019-06-11 09:44:09.907866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1164] 0: N N 2019-06-11 09:44:09.907872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1164] 1: N N 2019-06-11 09:44:09.907969: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1288] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 8184 MB memory) -> physical GPU (device: 0, name: Device 66af, pci bus id: 0000:05:00.0) 2019-06-11 09:44:09.987733: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1288] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 8184 MB memory) -> physical GPU (device: 1, name: Device 66af, pci bus id: 0000:08:00.0) Get train/holdout files. Pos samples: 217 Pos samples manual: 31 Ndsb3 samples: 12 0 ndsb3 pos labels train 0 ndsb3 neg labels train 0 ndsb3 pos labels holdout 0 ndsb3 neg labels holdout Edge samples: 159784 Luna samples: 356660 Falsepos LUNA count: 6233 Pos 248 ndsb2 pos: 0 ndsb2 neg: 0 Pos 50 ndsb2 pos: 0 ndsb2 neg: 0 Train count: 561901 , holdout count: 108454 W0611 09:44:12.604604 139787293587200 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:321: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.

W0611 09:44:12.606200 139787293587200 deprecation.py:506] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:634: calling RandomUniform.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0611 09:44:12.610019 139787293587200 deprecation.py:506] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:491: calling Constant.init (from tensorflow.python.ops.init_ops) with dtype is deprecated and will be removed in a future version. Instructions for updating: Call initializer instance with the dtype argument instead of passing it to the constructor W0611 09:44:12.666750 139787293587200 deprecation.py:506] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:1047: calling reduce_prod_v1 (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version. Instructions for updating: keep_dims is deprecated, use keepdims instead W0611 09:44:12.684238 139787293587200 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/optimizers.py:658: The name tf.train.Optimizer is deprecated. Please use tf.compat.v1.train.Optimizer instead.

W0611 09:44:12.687177 139787293587200 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:2446: The name tf.log is deprecated. Please use tf.math.log instead.

W0611 09:44:12.688659 139787293587200 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:2448: The name tf.nn.sigmoid_cross_entropy_with_logits is deprecated. Please use tf.nn.sigmoid_cross_entropy_with_logits instead.


Layer (type) Output Shape Param # Connected to

input_1 (InputLayer) (None, 32, 32, 32, 1) 0


averagepooling3d_1 (AveragePooling3D) (None, 16, 32, 32, 1) 0 input_1[0][0]


conv1 (Convolution3D) (None, 16, 32, 32, 64) 1792 averagepooling3d_1[0][0]


pool1 (MaxPooling3D) (None, 16, 16, 16, 64) 0 conv1[0][0]


conv2 (Convolution3D) (None, 16, 16, 16, 128) 221312 pool1[0][0]


pool2 (MaxPooling3D) (None, 8, 8, 8, 128) 0 conv2[0][0]


conv3a (Convolution3D) (None, 8, 8, 8, 256) 884992 pool2[0][0]


conv3b (Convolution3D) (None, 8, 8, 8, 256) 1769728 conv3a[0][0]


pool3 (MaxPooling3D) (None, 4, 4, 4, 256) 0 conv3b[0][0]


conv4a (Convolution3D) (None, 4, 4, 4, 512) 3539456 pool3[0][0]


conv4b (Convolution3D) (None, 4, 4, 4, 512) 7078400 conv4a[0][0]


pool4 (MaxPooling3D) (None, 2, 2, 2, 512) 0 conv4b[0][0]


last_64 (Convolution3D) (None, 1, 1, 1, 64) 262208 pool4[0][0]


out_class_last (Convolution3D) (None, 1, 1, 1, 1) 65 last_64[0][0]


out_malignancy_last (Convolution3D) (None, 1, 1, 1, 1) 65 last_64[0][0]


out_class (Flatten) (None, 1) 0 out_class_last[0][0]


out_malignancy (Flatten) (None, 1) 0 out_malignancy_last[0][0]

Total params: 13,758,018 Trainable params: 13,758,018 Non-trainable params: 0


W0611 09:44:12.950066 139787293587200 deprecation_wrapper.py:119] From /usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py:736: The name tf.assign is deprecated. Please use tf.compat.v1.assign instead.

learnrate: 0.001 epoch: 0 2019-06-11 09:44:13.452114: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 7.99G (8581545984 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452147: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 7.19G (7723390976 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452162: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 6.47G (6951051776 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452176: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 5.83G (6255946240 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452193: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 5.24G (5630351360 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452234: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 4.72G (5067316224 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452255: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 4.25G (4560584704 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452278: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 3.82G (4104526080 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452299: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 3.44G (3694073344 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452320: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 3.10G (3324665856 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452341: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 2.79G (2992199168 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452363: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 2.51G (2692979200 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452383: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 2.26G (2423681280 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452404: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 2.03G (2181313024 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452425: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 1.83G (1963181824 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452445: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 1.65G (1766863616 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452465: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 1.48G (1590177280 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452485: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 1.33G (1431159552 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452505: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 1.20G (1288043520 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452526: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 1.08G (1159239168 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452547: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 994.98M (1043315200 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452568: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 895.48M (938983680 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452588: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 805.94M (845085440 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452608: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 725.34M (760577024 bytes) from device: hipError_t(1002) 2019-06-11 09:44:13.452627: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 652.81M (684519424 bytes) from device: hipError_t(1002) 2019-06-11 09:44:15.019331: E tensorflow/stream_executor/rocm/rocm_driver.cc:493] failed to memset memory: HIP_ERROR_InvalidValue Traceback (most recent call last): File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1354, in _do_call return fn(*args) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1339, in _run_fn options, feed_dict, fetch_list, target_list, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1427, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.FailedPreconditionError: Failed to memcopy into scratch buffer for device 0 [[{{node _SOURCE}}]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "step2_train_nodule_detector.py", line 394, in train(train_full_set=True, load_weights_path=None, model_name="luna16_full", fold_count=-1, manual_labels=False) File "step2_train_nodule_detector.py", line 387, in train model.fit_generator(train_gen, len(train_files) / 1, 12, validation_data=holdout_gen, nb_val_samples=len(holdout_files) / 1, callbacks=[checkpoint, checkpoint_fixed_name, learnrate_scheduler]) File "/usr/local/lib/python3.5/dist-packages/keras/engine/training.py", line 1516, in fit_generator callbacks.on_epoch_begin(epoch) File "/usr/local/lib/python3.5/dist-packages/keras/callbacks.py", line 62, in on_epoch_begin callback.on_epoch_begin(epoch, logs) File "/usr/local/lib/python3.5/dist-packages/keras/callbacks.py", line 534, in on_epoch_begin K.set_value(self.model.optimizer.lr, lr) File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 1853, in set_value get_session().run(assign_op, feed_dict={assign_placeholder: value}) File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 125, in get_session _initialize_variables() File "/usr/local/lib/python3.5/dist-packages/keras/backend/tensorflow_backend.py", line 282, in _initialize_variables sess.run(tf.variables_initializer(uninitialized_variables)) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 948, in run run_metadata_ptr) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1171, in _run feed_dict_tensor, options, run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_run run_metadata) File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1368, in _do_call raise type(e)(node_def, op, message) tensorflow.python.framework.errors_impl.FailedPreconditionError: Failed to memcopy into scratch buffer for device 0 [[{{node _SOURCE}}]]

sunway513 commented 5 years ago

Hi @nuu9323226 , the error message is not directly associated with the 3D convolution support. Can you run the HIP unit test can let's see if we can get some hints there? https://github.com/ROCm-Developer-Tools/HIP/tree/master/tests

jerryyin commented 5 years ago

@sunway513 Thanks for letting me know.

After using the new container, the log confirms that it is run with conv3d support, which fixed the original issue of not pulling correct tf version for 3d convolution. The follow-up error log is a separate issue when running conv3d.

Looking at the error code, it looks like we can do very little in the short term. The error code indicates the user's system run out of memory, because:

sunway513 commented 5 years ago

@jerryyin , can you provide the MIOpen config that requires 7.99GB VMEM? cc @whchung , @daniellowell

jerryyin commented 5 years ago

@sunway513 I have no more than the developer provided. I can get more details if @nuu9323226 let us know more of how to reproduce. According to his log, the following line is printed per memory request from miopen. There are plenty of other requests like this as well.

2019-06-11 09:44:13.452114: E tensorflow/stream_executor/rocm/rocm_driver.cc:630] failed to allocate 7.99G (8581545984 bytes) from device: hipError_t(1002)

The config needs to be one of the below according to earlier log (More possibly the first/second ones because the 7.99G request happens earliest)

conv1 (Convolution3D) (None, 16, 32, 32, 64) conv2 (Convolution3D) (None, 16, 16, 16, 128) conv3a (Convolution3D) (None, 8, 8, 8, 256) conv3b (Convolution3D) (None, 8, 8, 8, 256) conv4a (Convolution3D) (None, 4, 4, 4, 512) conv4b (Convolution3D) (None, 4, 4, 4, 512)

sunway513 commented 5 years ago

Hi @nuu9323226 , could you help provide the steps to repro this issue for @jerryyin so we can further investigate it?

jerryyin commented 5 years ago

@sunway513 I'm not sure there's value in us continuing the investigation as there is no short term fix. I see this as more of a feature request to us. If there is any plans to fix it, it should be redirected to @daniellowell. To confirm it,

@daniellowell, Could you confirm that the comment above is accurate? To give a summary:

If you can confirm the above, please let us know your opinion of whether treating this as a feature request and close it, or if there is any short term plan in reducing the workspace memory consumption from 3d conv configs reported from this issue.

daniellowell commented 5 years ago

@jerryyin Immediate mode does support 3-d convolutions, however it again falls back to GEMM since only GEMM supports 3-d. We do have plans to implement 3-D convolution in more kernels, but it may be in one or two releases.

sunway513 commented 5 years ago

Thanks @jerryyin and @daniellowell for the clarifications. Let's hold this ticket and visit it back after MIOpen support is in place. I believe it would be valuable to have a real use case from the community in place, so we can use it to validate our future work :-)

jerryyin commented 5 years ago

Thanks @daniellowell. @sunway513 Hopefully we know how to reproduce it by then.

@nuu9323226 Could you provide instructions for reproducing the issue?

sunway513 commented 4 years ago

Ping @nuu9323226 , can you provide instructions to repro the issue?

jerryyin commented 4 years ago

This is possibly a duplicate to #619. The series of conv3D run out of memory issue due to large gemm sizes required.

sunway513 commented 4 years ago

Ping @nuu9323226 , can you provide instructions to repro the issue?

emerth commented 4 years ago

Hi Sunway, and nuu9323226 - I experienced something very similar where the memory required by miopen to instantiate BVLC GoogleNet model using hipcaffe would grow quite exponentially resulting in my Radeon VII running out of memory.

I solved it by deleting everything under ~/.cache/miopen and restarting the training. I know I had some object files in there generated by a previous mixture of versions of hip & mipoen. I know this is a script kiddy solution. But I'm an almost totally unreconstructed C programmer, and in the face of all this newfangled ML stuff I am, in fact, a script kiddy.

sunway513 commented 4 years ago

Hi @emerth , thanks for the comments. Every time you upgrade ROCm builds installed on your system, it's desired to clean up the MIOpen cache under ~/.cache folder. For details, please refer to the following doc: https://github.com/ROCmSoftwarePlatform/MIOpen#persistent-program-cache

emerth commented 4 years ago

Hi Peng,

I just wanted to send thanks to you and the ROCm & miopen team. With release 2.9 ROCm now runs Caffe (hipcaffe) well enough that I can run deepdream-type codes. In the past I would hit bugs in code generation when running the sort of recursive code deepdream uses. I seem to recall you forwarded an issue from this to the miopen or Clang people on my behalf last year, and the fix arrived in 2.9! So, thanks!

jerryyin commented 4 years ago

Closing due to inactivity. Feel free to comment/reopen if you are still seeing issues.