Closed hyungtaik-oh closed 2 months ago
Can you provide some more other information? this situation rarely happens in pytorch, it looks like a tensorflow error
Thank you for the quick reply!
The python library version I used is as follows : tensorflow==2.8.0 torch==1.10.0 torchkeras==3.0.2 torchmetrics==0.11.4 torchsummary==1.5.1 torchvision==0.12.0
can you provide some more other information?what's your GPU version and batch size?
My CUDA Version: 12.1, and the batch size is 256.
lol,I mean what type of gpu you use and gpu ram?
Ah, I see. I tested it on the RTX 2080 Ti, 12GB, and the RTX 4090, 24GB, but the same error occurs on both devices. 😭
try this code,i believe it is tensorflow memory error
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
try:
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
except RuntimeError as e:
print(e)
Thank you for your concern about the above issue!
When using PyTorch, a problem occurred because memory for model output was not automatically returned during iteration during training. I solved it by directly using the model’s output value when calculating the loss.
In base_tower.py
out = model(x)
optim.zero_grad()
loss = loss_func(out[0].squeeze(), y.squeeze(), reduction='sum')
Thank you for providing a good paper and code.
I used the package in requirements.txt you provided in Python 3.7 environment. But, I am encountering an issue where the GPU memory usage gradually increases during model training until it eventually leads to a CUDA out of memory error.
Is there any solution for this?