training speed is about 2x slower than JAX trainable version (Uni-Fold)

guolinke commented 2 years ago

device: 1 A100 with 40GB memory cuda: 11.3 Compared with https://github.com/dptech-corp/Uni-Fold, using model_2 setting, and the same data (only use one sample, and use DummyDataLoader in openfold).

And I follow this issue, https://github.com/aqlaboratory/openfold/issues/19, disabled clear_cache_between_blocks and deepspeed for cpu offload. The commit I used is https://github.com/aqlaboratory/openfold/commit/c4d9f57f9005f3e9e0325eff97b8232e328b4813

speed per example:		FP32	FP16
openfold	24.5 s	17 s
Uni-Fold	13.25 s	8.9 s

Is that expected? any tricks that I can get further speed-up?

gahdritz commented 2 years ago

No, this is not expected. In our previous A100 experiments we observed single-example times of 6.5-7s for the 256 crop. I'll get back to you once this has been verified on a recent build.

Are you using DeepSpeed at all? What ZeRO stage are you using if so?

Also, how long are you running each model before you recorded times?

Have you made any other changes to the model config besides disabling the cache clearing option?

guolinke commented 2 years ago

here is the code for my test: https://github.com/guolinke/openfold/tree/guoke/test the changeset is https://github.com/guolinke/openfold/commit/d36876319d745de2c6e921eb835fb750335778e3 and https://github.com/guolinke/openfold/commit/6fb440c1d6485bcd3b57837593f5a0600dda882d For deepspeed, I still use it, by changing it to stage 0 and cpu_offload=false.

To run the code, you should first gunzip test_data.pickle.gz then run the training command, python train_openfold.py . . . . 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1

For the time recording, I wait for several iterations until it is stable.

gahdritz commented 2 years ago

A few comments/questions:

I don't believe that the sample batch used in the unit tests is very well-suited for tests like this. I believe it uses a much smaller crop size, and it may be out of date in certain respects. If possible, I'd generate a real batch using the actual dataloader, optionally pickling that for subsequent tests.
Removing the copy.deepcopy() from the DummyDataLoader will likely have unintended consequences, since grad is enabled for several of the tensors inserted therein. I'd put it back and try the same test again.
Since this test is being run on just one GPU, I'd try just getting rid of DeepSpeed altogether.
Is Uni-Fold definitely doing all of the same stuff as OpenFold (which hews pretty close to the original supplement/source code)? Does it maintain, for example, an EMA of the model parameters? Is it doing the same number of recycling iterations? Is it checkpointing in all of the same places? And so on.
Try disabling the line in the training script that TorchScript's components of the model (the call to _script_preset seems to degrade performance on occasion).
I'm pretty sure this has no effect on performance, but there's no need for the --replace-sampler-ddp flag if you're just using 1 GPU.
Maybe try running training with the --benchmark flag enabled.
13 iterations might be too few for the runtime to stabilize. It usually takes a lot longer for me, but, granted, most of my testing is done on 2080 Ti's.

In any case, it's also possible that OpenFold's performance might have been affected by a recent change. Again, I'll be repeating our runtime tests on A100s ASAP (that might take a few days, though).

guolinke commented 2 years ago

Thank you .

I want to get rid of affect of data loader, so creating a dummy data for the benchmark. BTW, this data isn't from tests/test_data, I create it by myself. Its crop size is 256.
just tried, and the same speed.
removing deepspeed will be slower (22s for fp16). I think deepspeed's fused_adam may cause this difference.
I am not sure about that. I think the recycling number may cause this, I will make it the same and fixed. for other factors, do you think they will cause the large speed difference? updated: just check the code of https://github.com/dptech-corp/Uni-Fold/blob/8cc7bcedc23efd53a2f0b11de6657c61ac9204f5/unifold/model/modules.py#L481 (search hk.remat in this file), it seems the places of activation checkpointing are the same .
removing it indeed is faster, about 1s faster for fp16.
just tried, and almost the same speed.

gahdritz commented 2 years ago

Try disabling contiguous_gradients in DeepSpeed.

Continuing the discussion on previous points:

Since the crop size of the sample batch isn't right, I still think it's important to use a fresh one. If you want to discount the runtime of the dataloader, you can pickle it and then rerun the tests using the original DummyDataloader. 3 (need to put something here so GitHub doesn't change the number). I'm surprised to see such a big DeepSpeed performance difference---I've never seen a difference of more than a second or two. Quite odd. 4 (four). Certainly, the placement and number of activation checkpoints, the number of recycling iterations, and so on can affect the training iteration runtime.

(N.B. - I added 7. and 8. straight to my previous reply, possibly after you already responded. Sorry about that!)

gahdritz commented 2 years ago

Here's a datapoint in the meantime. Using the right-out-of-the-box setting from the same commit (c4d9f57), with the real dataloader, the slow cache clearing, DeepSpeed stage 2, CPU offloading, and the slow TorchScripting (so basically the worst-case scenario), I ran

python3 train_openfold.py data/ alignments/ /data/ga122/alphafold/pdb_mmcif/mmcif_files/ train_op_16 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --gpus 1 --replace_sampler_ddp=True --seed 44 --default_root_dir train_op_16 --deepspeed_config deepspeed_config.json --precision 16

on 1 consumer-grade 2080 Ti. After 13 iterations, I got:

It makes me think that something might be wrong with your torch/CUDA installation or something. I'm not sure.

guolinke commented 2 years ago

I use the docker mmdog/pytorch:pytorch1.10.0-cuda11.3 to run. I think I found the problem, with my created dummy data, openfold will fix the recycling number to 4 (3 no_grad + 1 grad), while uni-fold random samples from [0, 3] + 1. So I run uni-fold with fixed 3+1 recyling number again.

The update result:

	FP32	FP16
openfold	22.5 s	16 s
Uni-Fold	18.44 s	12 s

The result is much closer now.

BTW, I also update the comment (https://github.com/aqlaboratory/openfold/issues/34#issuecomment-997321701) above. I will try to disable ema, and other suggestion latter.

guolinke commented 2 years ago

disable ema is slightly faster, about 0.1s
--benchmark is almost the same speed
dilable contiguous_gradients is almost the same speed

gahdritz commented 2 years ago

Hm. I'll try to think of more discrepancies. I think there still have to be more; even if the 6.5-7s A100 time doesn't pan out, we shouldn't be getting essentially the same times on the A100 and 2080 Ti, especially considering the optimizations you've made.

guolinke commented 2 years ago

with uniform random recycling [1, 4], the speed of fp16 is about 11.9s for openfold. I am trying to create the real data for testing, but the download the preprocess speed is very slow. It will be great if you can share with me a toy small data for the test, like you used in above screen snapshot.

gahdritz commented 2 years ago

Yeah no problem. How best can I get it to you?

guolinke commented 2 years ago

thank you, in the way you are convenient, like google drive or Dropbox. my email is guolin.ke@outlook.com

guolinke commented 2 years ago

gently ping @gahdritz for the data sharing.

gahdritz commented 2 years ago

Sent.

Our A100 results were obtained using the following:

CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd

and with cache clearing disabled (but using the real dataloader).

guolinke commented 2 years ago

Thank you very much! I receive the data. it seems the data don't include template part (template_mmcif_dir and mmcif_cache.json), are they not needed?

gahdritz commented 2 years ago

The mmcif cache isn't required, but the template mmCIFs are. I'll send those over now.

lhatsk commented 2 years ago

Sent.

Our A100 results were obtained using the following:

CUDA Driver 465.19.01 CUDA 11.3 Update 1 (11.3.1.005) cuBLAS 11.5.1.109 (part of CUDA 11.3 U1) CUDNN 8.2.1.32 NCCL 2.9.9 PyTorch 1.9.0a0+c3d40fd

and with cache clearing disabled (but using the real dataloader).

Have you tried running it with bfloat16? Doesn't seem to be working.

File "openfold/openfold/utils/loss.py", line 46, in sigmoid_cross_entropy log_p = torch.nn.functional.logsigmoid(logits) RuntimeError: "log_sigmoid_forward_cuda" not implemented for 'BFloat16'

I'm also a bit surprised that the model params size is not adjusted. It should be half the size, same as with fp16, right?

gahdritz commented 2 years ago

Yes, we have tested bfloat16, and it's a lot better than fp16, but you'll need PyTorch 10 for that. The test I referenced previously used fp16.

lhatsk commented 2 years ago

Strange, I'm already on torch 1.10.1+cu113. Better in terms of what?

gahdritz commented 2 years ago

You won't NaN anymore.

Have you updated your DeepSpeed config for bf16 training?

lhatsk commented 2 years ago

I'm not using DeepSpeed in this experiment, just switched on precision="bf16" in PyTorch-lightning.

gahdritz commented 2 years ago

Hm. Could you test it with DeepSpeed one time? That's what our test used. I'd repeat the test without DeepSpeed myself, but the A100's we've been using are borrowed and not currently accessible.

lhatsk commented 2 years ago

It works, but OOM's which I believe it doesn't on FP16. Re-running the latter now. That's why I was wondering about the parameter size.

gahdritz commented 2 years ago

That's kind of weird. How much memory do you have on your A100s?

lhatsk commented 2 years ago

40GB. Single batch. I cap now validation targets at 700AA which did the trick.

gahdritz commented 2 years ago

Just 700? That's very odd. Is grad being enabled for validation runs or something?

lhatsk commented 2 years ago

Didn't really test anything, probably can be a bit larger (tested it on a v100s with 32GB). There were some 1k+ AA targets in the set beforehand.

Not sure about grad being enabled. Was wondering the same, but manually switching to no_grad didn't do anything and it's much faster compared to training on crops.

gahdritz commented 2 years ago

Actually on second thought it's not very weird that really long validation proteins should fail---chunking isn't enabled by default during validation, so you'll get much worse memory performance than during inference.

lhatsk commented 2 years ago

No OOM with FP16...

gahdritz commented 2 years ago

Did you actually mean v100s, or was that a typo? v100s don't have bfloat16 support.

guolinke commented 2 years ago

I also try BF16 with deepspeed, it seems the speed/memory cost is almost the same as fp32. So I don't think it is enabled.

guolinke commented 2 years ago

@gahdritz with the real data, I found the num_iters is always 1, so the speed is much faster. Is that expected?

lhatsk commented 2 years ago

Did you actually mean v100s, or was that a typo? v100s don't have bfloat16 support.

No typo, I have both v100s and A100. Ran the bfloat16 experiment on the A100.

gahdritz commented 2 years ago

@lhatsk would you mind moving this bfloat16 stuff into a new issue?

gahdritz commented 2 years ago

@guolinke No, that's not the case (let it run a little longer first). However, it is slightly bugged in that each DataLoader worker is currently initialized with the same seed, so you'll see values of the number of recycling iterations occur in groups of however many DataLoader workers you're using (so for me, with num_workers=8, you'd expect to see 8 1's approximately in a row, then 8 3's approximately in a row, and so on). For long enough tests, this shouldn't substantially affect the runtime (and for my previous test, I was using the seed 44, so for most of the first 13 iterations the number of recycling iterations was stuck at 2, so if anything, that was an overestimate).

E.g., with the seed 102238 and num_workers=4, I get

The numbers printed after each iteration are multiple copies of the number of recycling iterations for that iteration.

The bug is difficult to fix, and it's Christmas, so I'll get back to you with this on Monday. Happy holidays!

gahdritz commented 2 years ago

BTW @guolinke the recycling number bug is now fixed. The fix requires a little bit of extra data processing, and so it comes with a performance penalty of about half a second. I'm trying to think of ways to improve it.

guolinke commented 2 years ago

Thank you @gahdritz , I will test it. Happy holidays!

guolinke commented 2 years ago

hi @gahdritz , thank you for the fixing, I confirm that the num_iters problem is fixed. However, I meet another problem, It seems the training will stuck (gpu usage 0%, and freeze) randomly, with your provide data. I also try the latest main branch, without changes, and the same problem.

error log when press CTRL+C :

^CException ignored in: <function _MultiProcessingDataLoaderIter.__del__ at 0x7f0fea0b6430>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1328, in __del__
    self._shutdown_workers()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1301, in _shutdown_workers
    w.join(timeout=_utils.MP_STATUS_CHECK_INTERVAL)
  File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 149, in join
    res = self._popen.wait(timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/popen_fork.py", line 44, in wait
    if not wait([self.sentinel], timeout):
  File "/opt/conda/lib/python3.8/multiprocessing/connection.py", line 931, in wait
    ready = selector.select(timeout)
  File "/opt/conda/lib/python3.8/selectors.py", line 415, in select
    fd_event_list = self._selector.poll(timeout)

gahdritz commented 2 years ago

This has never happened to me. Something to raise with torch devs maybe? Maybe try adding more DataLoader workers?

guolinke commented 2 years ago

I think this problem is introduced by recent changes. I didn't meet it before merging the main branch as well.

gahdritz commented 2 years ago

Could you pinpoint the commit?

guolinke commented 2 years ago

sorry, I was wrong, it is not introduced by recent commits. I use reset --hard to track the commit, and find the problems exits before the date I clone openfold. I also try different pytorch-lightning versions, different num_workers, with/without deepspeed, none of them can work.

And when using dummy data, everything is fine.

gahdritz commented 2 years ago

Very strange. The only thing I can think to say is that I've tested OpenFold with both Python 3.7 and Python 3.9---never Python 3.8.

rfilgueiras commented 1 year ago

@gahdritz could you please share how are you currently using Python 3.9?

I do see only Python 3.7 as the supported Python version.

I created this issue asking for this specific question in general: https://github.com/aqlaboratory/openfold/issues/265

aqlaboratory / openfold

training speed is about 2x slower than JAX trainable version (Uni-Fold) #34