GPU memory usage - Githubissues

CeciLyu commented 2 years ago

Hi OmegaFold Team,

Congrats on your great work! I am hoping to use OmegaFold to predict structure of a relatively large protein (992 amino acid).

I have tried to set the subbatch_size to [2, 4, 8, 16, None], and the memory would not be sufficient regardless. I have also tried different subbatch_sizes on smaller proteins that can fit in my GPU, and the GRAM usage doesn't change no matter what subbatch_size I use.

I am using NVIDIA A100 40GB.

Could you take a look at why subbatch_size doesn't work?

Thanks, Suyue

RuiWang1998 commented 2 years ago

Hi,

Thanks for your interests and your question!

This is a crude version of subbatching and as we said in readme, it is going to be udpated very soon (a couple of days max).

The issue here is that we did not shard the computation to the bare minimum yet, and maybe somewhere along the line there were also problems that causes the GRAM requirements to go off the roof.

As of now, we'd thank you for your patience, and rest assured that we are working very hard on this issue. Until then, we will keep this issue open.

nope-sto commented 2 years ago

Same issue here - is there any possibility to use all available GPUs in parallel to work around the memory error? Thank you very much for your great efforts! Congrats

laoshaw commented 2 years ago

same issue here, I have two A40(A100 40GB) and it's reported out of CUDA memory so I had to use '--device cpu' instead to run omegafold.

on the other hand, when '--device cpu' is used, OmegaFold only leverages 1/3 of the cores I have(I have 128 AMD EPYC cores, but only about 44 cores are used actively, plus I have 512GB memory, only 120GB or so are used).

can you provide a rough rule-of-thumb estimation on how long it runs (hours, days, weeks...) when OmegaFold is used?

Thanks!

RuiWang1998 commented 2 years ago

Hi all,

Within 48 hours we are going to update the code for memory efficiency and within 72 hours we are going to provide an estimate on the ram usage as well as the time w.r.t. sequence length!

laoshaw commented 2 years ago

so far on A40 without using CUDA due to memory issue mentioned here, with about 44 AMD EPYC cores running under 512GB(using about 120GB), it took 20 hours to finish one "loop", and my simple input has about 66 inputs, so that will take 55 days to finish, I think two A40 vs 44 CPU shall not make such a large difference if "it takes a few minutes on A40", I would expect with cpu as the devices, this could finish in a few hours?

laoshaw commented 2 years ago

Hi all,

Within 48 hours we are going to update the code for memory efficiency and within 72 hours we are going to provide an estimate on the ram usage as well as the time w.r.t. sequence length!

any update? I have run it for 54 hours with around 44 CPUs loaded and it's still running...

RuiWang1998 commented 2 years ago

Hi all,

Sorry for the delay. It turned out to be slightly more delicate than we expected. But now it runs very long sequences with a much more sensitive to --subbatch_size. We are planning on looking into the issue more sometime down the line, but for now it seems good enough.

laoshaw commented 2 years ago

Hi all,

Sorry for the delay. It turned out to be slightly more delicate than we expected. But now it runs very long sequences with a much more sensitive to --subbatch_size. We are planning on looking into the issue more sometime down the line, but for now it seems good enough.

great,will try now,typically how to decide the range of subbatch_size? Thanks!

laoshaw commented 2 years ago

it failed to run under python3.10 venv envrionment:

omegafold brd4.fasta output

Traceback (most recent call last):
  File "/home/shawn/tmp/OmegaFold/.venv/bin/omegafold", line 33, in <module>
    sys.exit(load_entry_point('OmegaFold==0.0.0', 'console_scripts', 'omegafold')())
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/__main__.py", line 74, in main
    output = model(
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/model.py", line 175, in forward
    result, prev_dict = self.omega_fold_cycle(
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/model.py", line 89, in forward
    prev_node, edge_repr, node_repr = self.geoformer(
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/geoformer.py", line 175, in forward
    node_repr, edge_repr = block(
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/geoformer.py", line 122, in forward
    edge_repr += layer(edge_repr, mask[..., 0, :], fwd_cfg=fwd_cfg)
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/modules.py", line 676, in forward
    out = self._get_attended(edge_repr, mask, fwd_cfg)
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/modules.py", line 580, in _get_attended
    for s, e, edge_r in self._get_sharded_stacked(
  File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/modules.py", line 609, in _get_sharded_stacked
    start, end = idx * subbatch_size, (idx + 1) * subbatch_size
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

RuiWang1998 commented 2 years ago

Hi @laoshaw,

Sorry for this, we seemed to made --subbatch_size a necessary argument, which means you have to set --subbatch_size explicitly. We are working on a quick fix now.

Best

laoshaw commented 2 years ago

Not an expert here but still how to decide the range of subbatch_size? the default seems like 1560 a magic number in my testcase.

laoshaw commented 2 years ago

Add subbatch_size and it runs and it needs less GPU memory now, but still not enough to run successfully.

omegafold --subbatch_size 1024 brd4.fasta output
INFO:root:Loading weights from /home/shawn/.cache/omegafold_ckpt/model.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading brd4.fasta
INFO:root:Predicting 1th chain in brd4.fasta
INFO:root:1560 residues in this chain.
INFO:root:Failed to generate output/sp|Q9NZM4|BICRA_HUMAN BRD4-interacting chromatin-remodeling complex-associated protein OS=Homo sapiens OX=9606 GN=BICRA PE=1 SV=2.pdb due to CUDA out of memory. Tried to allocate 48.75 GiB (GPU 0; 44.37 GiB total capacity; 20.37 GiB already allocated; 22.40 GiB free; 20.85 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...
INFO:root:Done!

laoshaw commented 2 years ago

In previous version it was asking 110+ GB GPU memory from two A40s, now it is asking 48GB from one A40, the second A40 seems not in use at all.

RuiWang1998 commented 2 years ago

Guess just lower it.

For now we do not yet have a rule of thumb, but you can always half the subbatch size each time it runs into bottleneck.

We did not set any default in previous runs and the model will choose the sequence length as the subbatch_size, which is pretty large.

laoshaw commented 2 years ago

cut it in half worked. however still only one GPU is in use, another one is totally idle. would be nice if all GPUs can be used in parallel, the version before seems can do it? Thanks!

RuiWang1998 commented 2 years ago

We did not write the program to take advantage of two graphics cards though, so we are not sure if this is a specially technology from Nvidia and what happened before. As we do not have A40s in our machines so maybe we can not reproduce the phenomenon just yet, but we'll look into it.

Full multi-GPU support, on the other hand, is on our road-map, but may take a while

laoshaw commented 2 years ago

I see. At the moment one A40 and one CPU are fully loaded, another A40 and 127 CPUs are idle, I'm using a small brd4.fasta input, how long will this run to end, as a rule of thumb, is it in minutes|hours|days|weeks? in the last few days I ran with the same input and it never finishes(all 44 CPU cores)

laoshaw commented 2 years ago

By a closer look, the one CPU uses about 6GB memory while GPU only uses 1% of its 30GB memory(both CPU and GPU are 100% loaded), I feel this could be a lengthy run again(e.g. days or weeks).

RuiWang1998 commented 2 years ago

We haven't yet had the time to really tested the runtime with different subbatches yet, but will update the readme soon.

However, the GPU memory should not be that low. Could you please clarify the state of your GPU? Like what does 100% loaded mean? and where does 1% come from?

laoshaw commented 2 years ago

I use 'nvitop' for A40, a top like python tool to report memory usage and such, it reports a 1.1% memory usage ratio. I will upload a screenshot below

laoshaw commented 2 years ago

RuiWang1998 commented 2 years ago

Isn't the second card completely idle? By the look of it, it seems we are using the only first card, which looks just fine. The code should only be running on one card.

laoshaw commented 2 years ago

yes it's the 1.1% at the bottom concerns me, but I agree it's inconsistent with A40 #0 reported above.

RuiWang1998 commented 2 years ago

I see. The model should not take too much out of CPU memory though, so I suppose it is expected. It aligns well with our observation in our machine, that this percentage is normal.

RuiWang1998 commented 2 years ago

But if you feel this is too slow, you could increase the subbatch_size a bit to fully utilize the GPU memory as well.

laoshaw commented 2 years ago

tried other monitor tools and they report near 100% GPU0 usage on memory, it seems like nvitop is at fault here. I will report how long this takes to run when it completes.

laoshaw commented 2 years ago

finishes in 72 minutes, so it's a big improvement comparing to last release. Thanks!

sokrypton commented 2 years ago

Are there suggested subbatch sizes depending on the amount of memory available? Is this something that can be automatically configured?

RuiWang1998 commented 2 years ago

For now we just run the model and if it hits the limit we half the subbatch size. We are are trying to add that functionality, but with PyTorch's GRAM reservation behavior it might be a bit complicated. We are trying to get some measures ourselves but it might take a couple of days.

ygmusg commented 1 year ago

One question here: how to specify the GPU ( by default, it always use GPU 0) for OmegaFold?

wewewexiao2008 commented 1 year ago

One question here: how to specify the GPU ( by default, it always use GPU 0) for OmegaFold?

I do this by setting like --device cuda:1

HeliXonProtein / OmegaFold

GPU memory usage #8