Open neqkir opened 3 years ago
As per the log posted, the model seems to be attempting to allocate 40GB!! of GPU memory
2021-10-15 10:43:51.203149: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 40.73GiB (rounded to 43731715328)requested by op _EagerConst
Need to figure out what is triggering this memory allocation.
@neqkir , can I request you to rerun with the env var "MIOPEN_ENABLE_LOGGING=1" and then post the resulting log file here?
thanks
It looks to me that your arrays you are working with are based on the entire corpus. When using an accelerator you have to think a bit more about how to handle your data, if you have large amounts of it.
Your MI100s have 32G of memory onboard, so you are blowing past that. It also looks like you have two so you'll want to be able to make the most of both of them.
Take a look at a few of these guides to help you refactor your code a bit for GPU acceleration. You will likely want to create a tf.data dataset and use a mirrored strategy to make the best use of your hardware.
https://www.tensorflow.org/guide/gpu https://www.tensorflow.org/guide/distributed_training https://www.tensorflow.org/guide/data
I am facing the same issue, @neqkir were you able to solve it? I would appreciate it if you can post your solution here.
@neqkir @reyhaneh-92 i am facing same issue, please update here if you were able to solve
I am also facing the same issue. Moreover, the same code was working with TensorFlow 2.4 and throwing this after I upgraded TensorFlow to 2.10.
@mdtalibahmad I was able to resolve the issue after completely uninstalling cuda, python and all dependencies and reinstalling everything with correct version. even installed update for vcpp for latest cuda.
I'm also the same with this problem, when I use my soft code run in local computer using jupyter notebook it working but when I move my soft code to run in server and the same env but I got error. Pls help me to resolved this error thank you for your valuable time.
Use this code after load your libraries for increase your GPU read batch size:
gpus = tf.config.list_physical_devices('GPU') if gpus: try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
logical_gpus = tf.config.list_logical_devices('GPU')
print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
except RuntimeError as e:
print(e)
@mdtalibahmad I was able to resolve the issue after completely uninstalling cuda, python and all dependencies and reinstalling everything with correct version. even installed update for vcpp for latest cuda.
hello, what is the correct version? I'm using python 3.8, tf 2.10, cuda 11.2, cudnn 8.1.0, and still got the same issue, could you elaborate on which version is worked for you?
@mdtalibahmad I was able to resolve the issue after completely uninstalling cuda, python and all dependencies and reinstalling everything with correct version. even installed update for vcpp for latest cuda.
please write the versions of all dependencies work correctly for you.
Modifying the batch size fixed the error for me.
Modifying the batch size fixed the error for me.
Did you reduced or increased the batch size ?
Did you reduced or increased the batch size ?
Increased it.
A RNN code running well on the CPU, on the GPU getting this apparently "out of memory" error.
I run the code here https://github.com/neqkir/bible-like-text-generation/blob/main/word-based/word_rnn_bible_lstm.py
System information