Random Extreme RAM Usage and HIP_ERROR_OutOfMemory

SamuelBailey commented 2 years ago

Issue Type

Bug

Source

binary

Tensorflow Version

tf 2.8.0

Custom Code

Yes

OS Platform and Distribution

Linux Ubuntu 20.04

Mobile device

N/A

Python version

3.7.13 (also experienced on 3.8.10)

Bazel version

N/A

GCC/Compiler version

N/A

CUDA/cuDNN version

ROCm 5.1.1 (using tensorflow installed from pip: tensorflow-rocm)

GPU model and memory

(not compiled from source, but Vega 56 with 8GB VRAM, and system has 16GB DRAM)

Current Behaviour?

When using the tensorflow on my GPU with ROCm, RAM usage randomly jumps upwards about 9 GiB for a few seconds then drops back down, even with a very small Neural Network model. This behaviour has been observed both when using the rocm/tensorflow docker image, or when running natively, and does not appear to have any reliance on the type of model being run. So long as model.fit() is called, RAM spikes begin to occur.

These RAM spikes appear to be random, and once a call to model.fit() within tensorflow has been run, regardless of the contents or size of the model, the RAM spikes begin to occur. They reduce in frequency after the model.fit() call has finished, however often do not completely stop until a system restart. These spikes occur particularly frequently when clicking on another window on the PC, as though (at a guess) the GPU needs to transfer a large chunk of memory in order to context switch.

The code submitted (run via a jupyter notebook) causes the RAM spikes. When I restarted the python kernel and ran it a second time, the RAM spikes occurred simply when running model.compile, rather than having to wait until model.fit(), and HIP reported a HIP_ERROR_OutOfMemory.

Whilst RAM usage is high the rest of the computer becomes briefly almost completely unusable. I have experienced this issue when using a jupyter notebook in VSCode, and when using a jupyter notebook in a web browser.

I would expect RAM usage to very gradually climb during program execution, and not experience 2-3 second periods where a large number of gigabytes are used from RAM, then freed, causing the system to freeze.

Standalone code to reproduce the issue

import tensorflow as tf
import keras

import numpy as np

model_input = np.arange(64).astype(np.float32)
model_target = np.ones_like(model_input)

model = tf.keras.Sequential()
model.add(tf.keras.layers.Dense(64, input_shape=(64,), activation="relu"))
model.add(tf.keras.layers.Dense(64, activation="relu"))
model.compile(optimizer="adam", loss="mse")
model.summary()

model.fit(np.array([model_input]), np.array([model_target]), batch_size=1, epochs=10000)

Relevant log output

2022-05-06 20:09:35.897678: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.352517: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.352632: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.354065: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-05-06 20:09:36.354792: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.354970: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.355050: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.355843: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.355939: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.356022: I tensorflow/stream_executor/rocm/rocm_gpu_executor.cc:832] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-05-06 20:09:36.356383: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 7676 MB memory:  -> device: 0, name: Radeon RX Vega, pci bus id: 0000:03:00.0
2022-05-06 20:09:36.724124: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 7.50G (8048869376 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724188: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 6.75G (7243982336 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724218: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 6.07G (6519583744 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724245: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 5.46G (5867624960 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724272: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 4.92G (5280862208 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724298: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 4.43G (4752775680 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724325: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 3.98G (4277498112 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724351: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 3.58G (3849748224 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724377: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 3.23G (3464773376 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724403: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 2.90G (3118296064 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724429: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 2.61G (2806466304 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724456: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 2.35G (2525819648 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724482: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 2.12G (2273237504 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724508: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 1.91G (2045913856 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724534: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 1.71G (1841322496 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724560: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 1.54G (1657190144 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724587: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 1.39G (1491471104 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724614: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 1.25G (1342323968 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724640: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 1.12G (1208091648 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724666: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 1.01G (1087282432 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724692: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 933.22M (978554368 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724719: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 839.90M (880698880 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724745: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 755.91M (792628992 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724771: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 680.32M (713366272 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724797: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 612.29M (642029824 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724823: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 551.06M (577826816 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724849: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 495.95M (520044288 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724876: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 446.36M (468039936 bytes) from device: HIP_ERROR_OutOfMemory
2022-05-06 20:09:36.724902: E tensorflow/stream_executor/rocm/rocm_driver.cc:616] failed to allocate 401.72M (421235968 bytes) from device: HIP_ERROR_OutOfMemory

SamuelBailey commented 2 years ago

The first screenshot shows RAM usage whilst performing model.compile(). This is before the notebook had reached the model.run() call. The notebook had previously been run, but the python kernel had been reset, so all memory from the previous run should have been freed. The log output in my bug report occurred after this call to model.compile() Screenshot_20220506_201050

The second screenshot shows random RAM spikes which occur even after closing the jupyter notebook. On a fresh boot these do not occur until after a jupyter notebook using tensorflow with rocm has been run. Screenshot_20220506_201648

dipietrantonio commented 1 year ago

Hi there, I am experiencing the same issue and log messages.

ROCm / tensorflow-upstream