Closed damiankucharski closed 1 year ago
Hi @damiankucharski, our GPU has 48GB. Do you use the same batch size? You may try to reduce the batch size from 10 to 8, for example, or even smaller.
Hi @YixingHuang, I managed to fix the issue without changing the batch size. It seems that TensorFlow by default allocates all the available GPU memory which can cause issues in some cases like my own. I have submitted a small PR with a piece of code that solved the issue for me. You can consider merging it if you find it useful. https://github.com/YixingHuang/DeepMedicPlus/pull/10
Good to know. PR has been approved. Thank you.
Thank you @YixingHuang, I think you have to merge it yourself, I do not have permission to do that.
Hi @YixingHuang, I am getting out of memory errors when training on a little larger datasets. Sometimes trainings do not fail but often they do, it seems more or less random. I have 225 training subjects so the dataset is not that large. I am using 40GB Nvidia A100 GPU so the memory shouldn't be a problem. Do you think that something in code may be causing bad memory management? I am attaching the (truncated due to length) log of one of the failed trainings.