Develop workaround for GPU memory limit

markcoletti commented 3 years ago

Each V100 on Summit has 32Gb of memory, which limits the sizes of proteins that we can tackle. However, we should be able to treat all 6 GPUs as a single, virtual GPU to overcome this memory limitation. However, AlphaFold has been difficult to get to work with the multi-GPU model.

markcoletti commented 3 years ago

@proutrc I took a stab at this and almost certainly got it wrong. :(

proutrc commented 3 years ago

Each V100 on Summit has 32Gb of memory, which limits the sizes of proteins that we can tackle. However, we should be able to treat all 6 GPUs as a single, virtual GPU to overcome this memory limitation. However, AlphaFold has been difficult to get to work with the multi-GPU model.

Few corrections - it is confusing since the Github issue I referenced does make it seem like what you say, because of it's title.

Here is the Github issue I referenced in our Slack: https://github.com/deepmind/alphafold/issues/30

In reality, we can't group GPUs and parallelize the model across them. This is not supported by JAX as I understand it (that is likely pretty complicated). So, the model will get distributed to each GPU and run. Each standard compute node GPU, on Summit, has 16GB of device memory. The high-memory nodes have 32GB of device memory.

We can try and utilize the Nvidia Unified Memory capability though, which allows us to add 16GB of host memory access to the GPU (16GB of device memory + 16GB host memory). We are currently testing this. Further, there is a possibility that the XLA_PYTHON_CLIENT_MEM_FRACTION setting could help us if we have pre-allocation issues.

proutrc commented 3 years ago

It is worth noting we might just be able to run the larger sequences on the high memory nodes (the nodes enabled with 32GB device memory - only 54 of them). So, in depth troubleshooting may not be necessary for this.

markcoletti commented 3 years ago

I've done two runs and, as far as I know, I've not encountered memory errors. The latest run is in /gpfs/alpine/bip198/scratch/mcoletti/runs/issue-7, so I'll be looking through the worker_error* logs to see if anything obvious pops up.

markcoletti commented 3 years ago

The latest run I performed (job 1382936) had no memory errors after applying the jax flags that @proutrc recommended. However, we are still worried that there may be as yet seen proteins that may still yield memory errors, or even novel error types.

markcoletti commented 2 years ago

This run ran ok. It was under issue-8. (I.e., I combined the two issues for the one Summit run.) Closing.

BSDExabio / PSP

Develop workaround for GPU memory limit #4