ROCm / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
4 stars 3 forks source link

[BUG] unit test failures on Deepspeed upstream #56

Open bmedishe opened 2 years ago

bmedishe commented 2 years ago

Error Log : =========================== short test summary info ============================ FAILED tests/unit/test_checkpointing.py::test_checkpoint_moe[4] FAILED tests/unit/test_checkpointing.py::test_checkpoint_moe_and_zero[4-True] FAILED tests/unit/test_checkpointing.py::test_checkpoint_moe_and_zero[2-True] FAILED tests/unit/test_configurable_parallel.py::TestConfigurableMP::test_gpt2_basic ====== 4 failed, 581 passed, 58 skipped, 1 warning in 3850.22s (1:04:10) ======= Steps to reproduce : Follow the steps in this PR to install pytorch with hipify_torch as submodule After building and installing pytorch from source , clone DeepSpeed from upstream and do a jit build and run unit tests:

  1. git clone https://github.com/microsoft/DeepSpeed.git
  2. #include<THC/THCGeneral.h> from csrc/lamb/fused_lamb_cuda_kernel.cu removed before building
  3. ./install.sh (JIT build)
  4. DEEPSPEED_TEST_WITH_ROCM=1 pytest --forked tests/unit/test_* 2>&1 | tee deepspeed_unit_test