GPU Memory Gradually Increases Leading to CUDA Out of Memory Error During Model Training

archersama / IntTower

Source code of CIKM 2022 and DLP-KDD workshop 2022 Best Paper: IntTower-“ IntTower: the Next Generation of Two-Tower Model for Pre-ranking System”

Apache License 2.0

58 stars 10 forks source link

GPU Memory Gradually Increases Leading to CUDA Out of Memory Error During Model Training #11

Closed hyungtaik-oh closed 2 months ago

hyungtaik-oh commented 2 months ago

Thank you for providing a good paper and code.

I used the package in requirements.txt you provided in Python 3.7 environment. But, I am encountering an issue where the GPU memory usage gradually increases during model training until it eventually leads to a CUDA out of memory error.

Is there any solution for this?

archersama commented 2 months ago

Can you provide some more other information? this situation rarely happens in pytorch， it looks like a tensorflow error

hyungtaik-oh commented 2 months ago

Thank you for the quick reply!

The python library version I used is as follows : tensorflow==2.8.0 torch==1.10.0 torchkeras==3.0.2 torchmetrics==0.11.4 torchsummary==1.5.1 torchvision==0.12.0

archersama commented 2 months ago

can you provide some more other information？what's your GPU version and batch size？

hyungtaik-oh commented 2 months ago

My CUDA Version: 12.1, and the batch size is 256.

archersama commented 2 months ago

lol，I mean what type of gpu you use and gpu ram?

hyungtaik-oh commented 2 months ago

Ah, I see. I tested it on the RTX 2080 Ti, 12GB, and the RTX 4090, 24GB, but the same error occurs on both devices. 😭

archersama commented 2 months ago

try this code，i believe it is tensorflow memory error

gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
    try:
        for gpu in gpus:
            tf.config.experimental.set_memory_growth(gpu, True)
    except RuntimeError as e:
        print(e)

hyungtaik-oh commented 2 months ago

Thank you for your concern about the above issue!

When using PyTorch, a problem occurred because memory for model output was not automatically returned during iteration during training. I solved it by directly using the model’s output value when calculating the loss.

In base_tower.py

out = model(x)
optim.zero_grad()
loss = loss_func(out[0].squeeze(), y.squeeze(), reduction='sum')