Open guolinke opened 2 years ago
No, this is not expected. In our previous A100 experiments we observed single-example times of 6.5-7s for the 256 crop. I'll get back to you once this has been verified on a recent build.
Are you using DeepSpeed at all? What ZeRO stage are you using if so?
Also, how long are you running each model before you recorded times?
Have you made any other changes to the model config besides disabling the cache clearing option?
here is the code for my test: https://github.com/guolinke/openfold/tree/guoke/test the changeset is https://github.com/guolinke/openfold/commit/d36876319d745de2c6e921eb835fb750335778e3 and https://github.com/guolinke/openfold/commit/6fb440c1d6485bcd3b57837593f5a0600dda882d For deepspeed, I still use it, by changing it to stage 0 and cpu_offload=false.
To run the code, you should first gunzip test_data.pickle.gz
then run the training command, python train_openfold.py . . . . 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1
For the time recording, I wait for several iterations until it is stable.
A few comments/questions:
--replace-sampler-ddp
flag if you're just using 1 GPU.--benchmark
flag enabled.In any case, it's also possible that OpenFold's performance might have been affected by a recent change. Again, I'll be repeating our runtime tests on A100s ASAP (that might take a few days, though).
Thank you .
Continuing the discussion on previous points:
(N.B. - I added 7. and 8. straight to my previous reply, possibly after you already responded. Sorry about that!)
Here's a datapoint in the meantime. Using the right-out-of-the-box setting from the same commit (c4d9f57), with the real dataloader, the slow cache clearing, DeepSpeed stage 2, CPU offloading, and the slow TorchScripting (so basically the worst-case scenario), I ran
python3 train_openfold.py data/ alignments/ /data/ga122/alphafold/pdb_mmcif/mmcif_files/ train_op_16 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --gpus 1 --replace_sampler_ddp=True --seed 44 --default_root_dir train_op_16 --deepspeed_config deepspeed_config.json --precision 16
on 1 consumer-grade 2080 Ti. After 13 iterations, I got:
It makes me think that something might be wrong with your torch/CUDA installation or something. I'm not sure.
I use the docker mmdog/pytorch:pytorch1.10.0-cuda11.3
to run.
I think I found the problem, with my created dummy data, openfold will fix the recycling number to 4 (3 no_grad + 1 grad), while uni-fold random samples from [0, 3] + 1. So I run uni-fold with fixed 3+1 recyling number again.
The update result:
FP32 | FP16 | |
---|---|---|
openfold | 22.5 s | 16 s |
Uni-Fold | 18.44 s | 12 s |
The result is much closer now.
BTW, I also update the comment (https://github.com/aqlaboratory/openfold/issues/34#issuecomment-997321701) above. I will try to disable ema, and other suggestion latter.
--benchmark
is almost the same speedcontiguous_gradients
is almost the same speedHm. I'll try to think of more discrepancies. I think there still have to be more; even if the 6.5-7s A100 time doesn't pan out, we shouldn't be getting essentially the same times on the A100 and 2080 Ti, especially considering the optimizations you've made.
with uniform random recycling [1, 4], the speed of fp16 is about 11.9s for openfold. I am trying to create the real data for testing, but the download the preprocess speed is very slow. It will be great if you can share with me a toy small data for the test, like you used in above screen snapshot.
Yeah no problem. How best can I get it to you?
thank you, in the way you are convenient, like google drive or Dropbox. my email is guolin.ke@outlook.com
gently ping @gahdritz for the data sharing.
Sent.
Our A100 results were obtained using the following:
CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd
and with cache clearing disabled (but using the real dataloader).
Thank you very much! I receive the data.
it seems the data don't include template part (template_mmcif_dir
and mmcif_cache.json
), are they not needed?
The mmcif cache isn't required, but the template mmCIFs are. I'll send those over now.
Sent.
Our A100 results were obtained using the following:
CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd
and with cache clearing disabled (but using the real dataloader).
Have you tried running it with bfloat16? Doesn't seem to be working.
File "openfold/openfold/utils/loss.py", line 46, in sigmoid_cross_entropy log_p = torch.nn.functional.logsigmoid(logits) RuntimeError: "log_sigmoid_forward_cuda" not implemented for 'BFloat16'
I'm also a bit surprised that the model params size is not adjusted. It should be half the size, same as with fp16, right?
Yes, we have tested bfloat16, and it's a lot better than fp16, but you'll need PyTorch 10 for that. The test I referenced previously used fp16.
Strange, I'm already on torch 1.10.1+cu113. Better in terms of what?
You won't NaN anymore.
Have you updated your DeepSpeed config for bf16 training?
I'm not using DeepSpeed in this experiment, just switched on precision="bf16" in PyTorch-lightning.
Hm. Could you test it with DeepSpeed one time? That's what our test used. I'd repeat the test without DeepSpeed myself, but the A100's we've been using are borrowed and not currently accessible.
It works, but OOM's which I believe it doesn't on FP16. Re-running the latter now. That's why I was wondering about the parameter size.
That's kind of weird. How much memory do you have on your A100s?
40GB. Single batch. I cap now validation targets at 700AA which did the trick.
Just 700? That's very odd. Is grad being enabled for validation runs or something?
Didn't really test anything, probably can be a bit larger (tested it on a v100s with 32GB). There were some 1k+ AA targets in the set beforehand.
Not sure about grad being enabled. Was wondering the same, but manually switching to no_grad didn't do anything and it's much faster compared to training on crops.
Actually on second thought it's not very weird that really long validation proteins should fail---chunking isn't enabled by default during validation, so you'll get much worse memory performance than during inference.
No OOM with FP16...
Did you actually mean v100s, or was that a typo? v100s don't have bfloat16 support.
I also try BF16 with deepspeed, it seems the speed/memory cost is almost the same as fp32. So I don't think it is enabled.
@gahdritz with the real data, I found the num_iters
is always 1, so the speed is much faster. Is that expected?
Did you actually mean v100s, or was that a typo? v100s don't have bfloat16 support.
No typo, I have both v100s and A100. Ran the bfloat16 experiment on the A100.
@lhatsk would you mind moving this bfloat16 stuff into a new issue?
@guolinke No, that's not the case (let it run a little longer first). However, it is slightly bugged in that each DataLoader worker is currently initialized with the same seed, so you'll see values of the number of recycling iterations occur in groups of however many DataLoader workers you're using (so for me, with num_workers=8, you'd expect to see 8 1's approximately in a row, then 8 3's approximately in a row, and so on). For long enough tests, this shouldn't substantially affect the runtime (and for my previous test, I was using the seed 44, so for most of the first 13 iterations the number of recycling iterations was stuck at 2, so if anything, that was an overestimate).
E.g., with the seed 102238 and num_workers=4, I get
The numbers printed after each iteration are multiple copies of the number of recycling iterations for that iteration.
The bug is difficult to fix, and it's Christmas, so I'll get back to you with this on Monday. Happy holidays!
BTW @guolinke the recycling number bug is now fixed. The fix requires a little bit of extra data processing, and so it comes with a performance penalty of about half a second. I'm trying to think of ways to improve it.
Thank you @gahdritz , I will test it. Happy holidays!
hi @gahdritz , thank you for the fixing, I confirm that the num_iters problem is fixed. However, I meet another problem, It seems the training will stuck (gpu usage 0%, and freeze) randomly, with your provide data. I also try the latest main branch, without changes, and the same problem.
error log when press CTRL+C :
^CException ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f0fea0b6430>
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
self._shutdown_workers()
File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1301, in _shutdown_workers
w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 149, in join
res = self._popen.wait(timeout)
File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
if not wait([self.sentinel], timeout):
File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 931, in wait
ready = selector.select(timeout)
File "/opt/conda/lib/python3.8/selectors.py", line 415, in select
fd_event_list = self._selector.poll(timeout)
This has never happened to me. Something to raise with torch devs maybe? Maybe try adding more DataLoader workers?
I think this problem is introduced by recent changes. I didn't meet it before merging the main branch as well.
Could you pinpoint the commit?
sorry, I was wrong, it is not introduced by recent commits. I use reset --hard
to track the commit, and find the problems exits before the date I clone openfold.
I also try different pytorch-lightning versions, different num_workers, with/without deepspeed, none of them can work.
And when using dummy data, everything is fine.
Very strange. The only thing I can think to say is that I've tested OpenFold with both Python 3.7 and Python 3.9---never Python 3.8.
@gahdritz could you please share how are you currently using Python 3.9?
I do see only Python 3.7 as the supported Python version.
I created this issue asking for this specific question in general: https://github.com/aqlaboratory/openfold/issues/265
device: 1 A100 with 40GB memory cuda: 11.3 Compared with https://github.com/dptech-corp/Uni-Fold, using
model_2
setting, and the same data (only use one sample, and useDummyDataLoader
in openfold).And I follow this issue, https://github.com/aqlaboratory/openfold/issues/19, disabled
clear_cache_between_blocks
anddeepspeed
for cpu offload. The commit I used is https://github.com/aqlaboratory/openfold/commit/c4d9f57f9005f3e9e0325eff97b8232e328b4813Is that expected? any tricks that I can get further speed-up?