ROCm / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
5 stars 3 forks source link

fused_adam_frontend.cpp and flatten_unflatten.cpp not found. #52

Closed 20171130 closed 2 years ago

20171130 commented 2 years ago

I installed DeepSpeed for ROCM by cloning this repo and running:

bash install.sh -r

Here is the installation report

DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
fused_adam ............. [NO] ....... [OKAY]
fused_lamb ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn cuda is not available from torch
sparse_attn ............ [NO] ....... [NO]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-devel package with yum
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be fou
nd.
async_io ............... [NO] ....... [NO]
transformer_inference .. [NO] ....... [OKAY]
utils .................. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch']
torch version .................... 1.10.1+rocm4.2
torch cuda version ............... None
torch hip version ................ 4.2.21155-37cb3a34
nvcc version .....................  [FAIL] cannot find CUDA_HOME via torch.utils.cpp_extension.CUDA_HOME=None
deepspeed install path ........... ['/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed']
deepspeed info ................... 0.5.8+bb4d5bf, bb4d5bf, master
deepspeed wheel compiled w. ...... torch 1.10, cuda 0.0, hip 4.2

When trying to train openfold, I got

  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 397, in pre_dis
patch
    self.init_deepspeed()
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 474, in init_de
epspeed
    self._initialize_deepspeed_train(model)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 507, i[15/1933$
lize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 431, in _setup_
model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 293, in __init__
    self._configure_optimizer(optimizer, model_parameters)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1096, in _configure_optimizer
    basic_optimizer = self._configure_basic_optimizer(model_parameters)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1187, in _configure_basic_optimizer
    optimizer = FusedAdam(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/adam/fused_adam.py", line 72, in __init__
    fused_adam_cuda = FusedAdamBuilder().load()
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 400, in load
    return self.jit_load(verbose)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 432, in jit_load
    op_module = load(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
    return _jit_compile(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1307, in _jit_compile
    version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 45, in bump_version_if_changed
    hash_value = hash_source_files(hash_value, source_files)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
    with open(filename) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/csrc/adam/fused_adam_frontend.cpp'

and

...
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 397, i[16/1858$
patch
    self.init_deepspeed()
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 474, in init_de
epspeed
    self._initialize_deepspeed_train(model)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 507, in _initi$
lize_deepspeed_train
    model, deepspeed_optimizer = self._setup_model_and_optimizer(model, optimizer, scheduler)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/deepspeed.py", line 431, in _setup$
model_and_optimizer
    deepspeed_engine, deepspeed_optimizer, _, _ = deepspeed.initialize(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/__init__.py", line 119, in initialize
    engine = DeepSpeedEngine(args=args,
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 339, in __init__
    util_ops = UtilsBuilder().load()
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 400, in load
    return self.jit_load(verbose)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/op_builder/builder.py", line 432, in jit_load
    op_module = load(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1124, in load
    return _jit_compile(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/cpp_extension.py", line 1307, in _jit_compile
    version = JIT_EXTENSION_VERSIONER.bump_version_if_changed(
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 45, in bump_version_if_changed
    hash_value = hash_source_files(hash_value, source_files)
  File "/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/torch/utils/_cpp_extension_versioner.py", line 15, in hash_source_files
    with open(filename) as file:
FileNotFoundError: [Errno 2] No such file or directory: '/opt/anaconda3/envs/openfold_venv/lib/python3.8/site-packages/deepspeed/ops/csrc/utils/flatten_unflatten.cpp'

To reproduce

Clone https://github.com/guolinke/openfold/tree/guoke/test and switch to test branch. To run the code, you should first gunzip test_data.pickle.gz then run the training command, python train_openfold.py . . . . 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --replace_sampler_ddp=True --seed 42 --deepspeed_config_path deepspeed_config.json --gpus 1

python train_openfold.py . . . . 2021-10-10 --template_release_dates_cache_path mmcif_cache.json --precision 16 --replace_sampler_ddp=True --seed 42 --gpus 1 --deepspeed_config_path deepspeed_config.json