Open CeciLyu opened 2 years ago
Hi,
Thanks for your interests and your question!
This is a crude version of subbatching and as we said in readme, it is going to be udpated very soon (a couple of days max).
The issue here is that we did not shard the computation to the bare minimum yet, and maybe somewhere along the line there were also problems that causes the GRAM requirements to go off the roof.
As of now, we'd thank you for your patience, and rest assured that we are working very hard on this issue. Until then, we will keep this issue open.
Same issue here - is there any possibility to use all available GPUs in parallel to work around the memory error? Thank you very much for your great efforts! Congrats
same issue here, I have two A40(A100 40GB) and it's reported out of CUDA memory so I had to use '--device cpu' instead to run omegafold.
on the other hand, when '--device cpu' is used, OmegaFold only leverages 1/3 of the cores I have(I have 128 AMD EPYC cores, but only about 44 cores are used actively, plus I have 512GB memory, only 120GB or so are used).
can you provide a rough rule-of-thumb estimation on how long it runs (hours, days, weeks...) when OmegaFold is used?
Thanks!
Hi all,
Within 48 hours we are going to update the code for memory efficiency and within 72 hours we are going to provide an estimate on the ram usage as well as the time w.r.t. sequence length!
so far on A40 without using CUDA due to memory issue mentioned here, with about 44 AMD EPYC cores running under 512GB(using about 120GB), it took 20 hours to finish one "loop", and my simple input has about 66 inputs, so that will take 55 days to finish, I think two A40 vs 44 CPU shall not make such a large difference if "it takes a few minutes on A40", I would expect with cpu as the devices, this could finish in a few hours?
Hi all,
Within 48 hours we are going to update the code for memory efficiency and within 72 hours we are going to provide an estimate on the ram usage as well as the time w.r.t. sequence length!
any update? I have run it for 54 hours with around 44 CPUs loaded and it's still running...
Hi all,
Sorry for the delay. It turned out to be slightly more delicate than we expected. But now it runs very long sequences with a much more sensitive to --subbatch_size
. We are planning on looking into the issue more sometime down the line, but for now it seems good enough.
Hi all,
Sorry for the delay. It turned out to be slightly more delicate than we expected. But now it runs very long sequences with a much more sensitive to
--subbatch_size
. We are planning on looking into the issue more sometime down the line, but for now it seems good enough.
great,will try now,typically how to decide the range of subbatch_size? Thanks!
it failed to run under python3.10 venv envrionment:
omegafold brd4.fasta output
Traceback (most recent call last):
File "/home/shawn/tmp/OmegaFold/.venv/bin/omegafold", line 33, in <module>
sys.exit(load_entry_point('OmegaFold==0.0.0', 'console_scripts', 'omegafold')())
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
return func(*args, **kwargs)
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/__main__.py", line 74, in main
output = model(
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/model.py", line 175, in forward
result, prev_dict = self.omega_fold_cycle(
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/model.py", line 89, in forward
prev_node, edge_repr, node_repr = self.geoformer(
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/geoformer.py", line 175, in forward
node_repr, edge_repr = block(
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/geoformer.py", line 122, in forward
edge_repr += layer(edge_repr, mask[..., 0, :], fwd_cfg=fwd_cfg)
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1130, in _call_impl
return forward_call(*input, **kwargs)
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/modules.py", line 676, in forward
out = self._get_attended(edge_repr, mask, fwd_cfg)
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/modules.py", line 580, in _get_attended
for s, e, edge_r in self._get_sharded_stacked(
File "/home/shawn/tmp/OmegaFold/.venv/lib/python3.10/site-packages/omegafold/modules.py", line 609, in _get_sharded_stacked
start, end = idx * subbatch_size, (idx + 1) * subbatch_size
TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'
Hi @laoshaw,
Sorry for this, we seemed to made --subbatch_size
a necessary argument, which means you have to set --subbatch_size
explicitly. We are working on a quick fix now.
Best
Not an expert here but still how to decide the range of subbatch_size? the default seems like 1560 a magic number in my testcase.
Add subbatch_size and it runs and it needs less GPU memory now, but still not enough to run successfully.
omegafold --subbatch_size 1024 brd4.fasta output
INFO:root:Loading weights from /home/shawn/.cache/omegafold_ckpt/model.pt
INFO:root:Constructing OmegaFold
INFO:root:Reading brd4.fasta
INFO:root:Predicting 1th chain in brd4.fasta
INFO:root:1560 residues in this chain.
INFO:root:Failed to generate output/sp|Q9NZM4|BICRA_HUMAN BRD4-interacting chromatin-remodeling complex-associated protein OS=Homo sapiens OX=9606 GN=BICRA PE=1 SV=2.pdb due to CUDA out of memory. Tried to allocate 48.75 GiB (GPU 0; 44.37 GiB total capacity; 20.37 GiB already allocated; 22.40 GiB free; 20.85 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
INFO:root:Skipping...
INFO:root:Done!
In previous version it was asking 110+ GB GPU memory from two A40s, now it is asking 48GB from one A40, the second A40 seems not in use at all.
Guess just lower it.
For now we do not yet have a rule of thumb, but you can always half the subbatch size each time it runs into bottleneck.
We did not set any default in previous runs and the model will choose the sequence length as the subbatch_size, which is pretty large.
cut it in half worked. however still only one GPU is in use, another one is totally idle. would be nice if all GPUs can be used in parallel, the version before seems can do it? Thanks!
We did not write the program to take advantage of two graphics cards though, so we are not sure if this is a specially technology from Nvidia and what happened before. As we do not have A40s in our machines so maybe we can not reproduce the phenomenon just yet, but we'll look into it.
Full multi-GPU support, on the other hand, is on our road-map, but may take a while
I see. At the moment one A40 and one CPU are fully loaded, another A40 and 127 CPUs are idle, I'm using a small brd4.fasta input, how long will this run to end, as a rule of thumb, is it in minutes|hours|days|weeks? in the last few days I ran with the same input and it never finishes(all 44 CPU cores)
By a closer look, the one CPU uses about 6GB memory while GPU only uses 1% of its 30GB memory(both CPU and GPU are 100% loaded), I feel this could be a lengthy run again(e.g. days or weeks).
We haven't yet had the time to really tested the runtime with different subbatches yet, but will update the readme soon.
However, the GPU memory should not be that low. Could you please clarify the state of your GPU? Like what does 100% loaded mean? and where does 1% come from?
I use 'nvitop' for A40, a top like python tool to report memory usage and such, it reports a 1.1% memory usage ratio. I will upload a screenshot below
Isn't the second card completely idle? By the look of it, it seems we are using the only first card, which looks just fine. The code should only be running on one card.
yes it's the 1.1% at the bottom concerns me, but I agree it's inconsistent with A40 #0 reported above.
I see. The model should not take too much out of CPU memory though, so I suppose it is expected. It aligns well with our observation in our machine, that this percentage is normal.
But if you feel this is too slow, you could increase the subbatch_size a bit to fully utilize the GPU memory as well.
tried other monitor tools and they report near 100% GPU0 usage on memory, it seems like nvitop is at fault here. I will report how long this takes to run when it completes.
finishes in 72 minutes, so it's a big improvement comparing to last release. Thanks!
Are there suggested subbatch sizes depending on the amount of memory available? Is this something that can be automatically configured?
For now we just run the model and if it hits the limit we half the subbatch size. We are are trying to add that functionality, but with PyTorch's GRAM reservation behavior it might be a bit complicated. We are trying to get some measures ourselves but it might take a couple of days.
One question here: how to specify the GPU ( by default, it always use GPU 0) for OmegaFold?
One question here: how to specify the GPU ( by default, it always use GPU 0) for OmegaFold?
I do this by setting like --device cuda:1
Hi OmegaFold Team,
Congrats on your great work! I am hoping to use OmegaFold to predict structure of a relatively large protein (992 amino acid).
I have tried to set the subbatch_size to [2, 4, 8, 16, None], and the memory would not be sufficient regardless. I have also tried different subbatch_sizes on smaller proteins that can fit in my GPU, and the GRAM usage doesn't change no matter what subbatch_size I use.
I am using NVIDIA A100 40GB.
Could you take a look at why subbatch_size doesn't work?
Thanks, Suyue