Open YangHao97 opened 2 years ago
Hey, @YangHao97 Maybe the issue with the variable transducer_loss
. Do not store it directly in ret
. Try to reuse it and return a numpy array, or something like this.
Hi @1ytic , thanks for your reply.
I have changed the codes in figure 1 to the codes in figure 2, and it works. I'm not sure if different versions of pytorch, lightning and so on lead to this problem.
figure 1
figure 2
I'm still think that issue with the variable transducer_loss
. You have to free it, instead of store it somewhere else. Anyway, if you have found the workaround, I'm happy for you.
Thank you so much!
Hi, I'm using rnnt-loss and pytorch-lightning to train my model. But I found the 4D tensor which is used to calculate transducer will be accumulated in GPU, when I check the GPU memory in training step(before a batch starts), there are many 4D tensor(come from the previous batches) in the GPU memory. That will lead to CUDA out of memory finally. I don't know what went wrong.
gpu_tracker is used to check the GPU memory.
The loss in training step is from this.
This is the result of GPU memory usage in training step.
I try to use 'del', 'gc.collect()' and 'torch.cuda.empty_cache()' in everywhere, but they are all useless.