Potential source of GPU memory leak in `ESMFold`

Hello.

I have recently been testing ESMFold's Attention module for separate use cases, and I believe I discovered a (potential) source of GPU memory leaks. In my testing, while monitoring the ratio of current GPU memory allocated to max GPU memory historically allocated via print(f"GPU memory ratio: {torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated()}"), I notice that unless I change q = self.rescale_factor * q to q *= self.rescale_factor I experience an out-of-memory error in PyTorch during the backward pass after approximately 500 training steps (in my particular use case). Would anyone happen to have some insights as to why this might occur in specific use cases, or is it possible that this phenomenon affects ESMFold generally speaking?

https://github.com/facebookresearch/esm/blob/c9c7d4f0fec964ce10c3e11dccec6c16edaa5144/esm/esmfold/v1/misc.py#L188

facebookresearch / esm

Potential source of GPU memory leak in `ESMFold` #543