Transducer loss leads to memory leak

YangHao97 commented 2 years ago

Hi, I'm using rnnt-loss and pytorch-lightning to train my model. But I found the 4D tensor which is used to calculate transducer will be accumulated in GPU, when I check the GPU memory in training step(before a batch starts), there are many 4D tensor(come from the previous batches) in the GPU memory. That will lead to CUDA out of memory finally. I don't know what went wrong. gpu_tracker is used to check the GPU memory. The loss in training step is from this. This is the result of GPU memory usage in training step. I try to use 'del', 'gc.collect()' and 'torch.cuda.empty_cache()' in everywhere, but they are all useless.

1ytic commented 2 years ago

Hey, @YangHao97 Maybe the issue with the variable transducer_loss. Do not store it directly in ret. Try to reuse it and return a numpy array, or something like this.

YangHao97 commented 2 years ago

Hi @1ytic , thanks for your reply. I have changed the codes in figure 1 to the codes in figure 2, and it works. I'm not sure if different versions of pytorch, lightning and so on lead to this problem. figure 1

figure 2

1ytic commented 2 years ago

I'm still think that issue with the variable transducer_loss. You have to free it, instead of store it somewhere else. Anyway, if you have found the workaround, I'm happy for you.

YangHao97 commented 2 years ago

Thank you so much!

1ytic / warp-rnnt

Transducer loss leads to memory leak #21