I installed DeepSpeed for ROCM by cloning this repo and running:
bash install.sh -r
Here is the installation report
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
runtime if needed. Op compatibility means that your system
meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
[WARNING] sparse_attn cuda is not available from torch
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
[WARNING] async_io requires the dev libaio .so object and headers but these were not found.
[WARNING] async_io: please install the libaio-devel package with yum
[WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be fou
nd.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch']
torch version .................... 1.10.1+rocm4.2
torch cuda version ............... None
torch hip version ................ 4.2.21155-37cb3a34
nvcc version ..................... [FAIL] cannot find CUDA_HOME via torch.utils.cpp_extension.CUDA_HOME=None
deepspeed install path ........... ['/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.8+bb4d5bf, bb4d5bf, master
deepspeed wheel compiled w. ...... torch 1.10, cuda 0.0, hip 4.2
When trying to train openfold, I got
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 397, in pre_dis
patch
self.init_deepspeed()
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 474, in init_de
epspeed
self._initialize_deepspeed_train(model)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 507, i[15/1933$
lize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 431, in _setup_
model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 293, in __init__
self._configure_optimizer(optimizer, model_parameters)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1096, in _configure_optimizer
basic_optimizer = self._configure_basic_optimizer(model_parameters)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_basic_optimizer
optimizer = FusedAdam(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
fused_adam_cuda = FusedAdamBuilder().load()
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 400, in load
return self.jit_load(verbose)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 432, in jit_load
op_module = load(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
return _jit_compile(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1307, in _jit_compile
version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 45, in bump_version_if_changed
hash_value = hash_source_files(hash_value, source_files)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
with open(filename) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp'
and
...
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 397, i[16/1858$
patch
self.init_deepspeed()
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 474, in init_de
epspeed
self._initialize_deepspeed_train(model)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 507, in _initi$
lize_deepspeed_train
model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 431, in _setup$
model_and_optimizer
deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
engine = DeepSpeedEngine(args=args,
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 339, in __init__
util_ops = UtilsBuilder().load()
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 400, in load
return self.jit_load(verbose)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 432, in jit_load
op_module = load(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
return _jit_compile(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1307, in _jit_compile
version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 45, in bump_version_if_changed
hash_value = hash_source_files(hash_value, source_files)
File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
with open(filename) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp'
To reproduce
Clone https://github.com/guolinke/openfold/tree/guoke/test and switch to test branch.
To run the code, you should first gunzip test_data.pickle.gz
then run the training command, python train_openfold.py . . . . 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1
I installed DeepSpeed for ROCM by cloning this repo and running:
Here is the installation report
When trying to train openfold, I got
and
To reproduce
Clone
https://github.com/guolinke/openfold/tree/guoke/test
and switch to test branch. To run the code, you should first gunzip test_data.pickle.gz then run the training command, python train_openfold.py . . . . 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1