I have recently been testing ESMFold's Attention module for separate use cases, and I believe I discovered a (potential) source of GPU memory leaks. In my testing, while monitoring the ratio of current GPU memory allocated to max GPU memory historically allocated via print(f"GPU memory ratio: {torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated()}"), I notice that unless I change q = self.rescale_factor * q to q *= self.rescale_factor I experience an out-of-memory error in PyTorch during the backward pass after approximately 500 training steps (in my particular use case). Would anyone happen to have some insights as to why this might occur in specific use cases, or is it possible that this phenomenon affects ESMFold generally speaking?
False alarm. I discovered that my out-of-memory issue was (most likely) being caused by an external issue, so I believe this issue is no longer relevant (or valid) to the ESM repository.
Hello.
I have recently been testing
ESMFold
'sAttention
module for separate use cases, and I believe I discovered a (potential) source of GPU memory leaks. In my testing, while monitoring the ratio of current GPU memory allocated to max GPU memory historically allocated viaprint(f"GPU memory ratio: {torch.cuda.memory_allocated() / torch.cuda.max_memory_allocated()}")
, I notice that unless I changeq = self.rescale_factor * q
toq *= self.rescale_factor
I experience an out-of-memory error in PyTorch during the backward pass after approximately 500 training steps (in my particular use case). Would anyone happen to have some insights as to why this might occur in specific use cases, or is it possible that this phenomenon affectsESMFold
generally speaking?https://github.com/facebookresearch/esm/blob/c9c7d4f0fec964ce10c3e11dccec6c16edaa5144/esm/esmfold/v1/misc.py#L188