ROCm / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
5 stars 3 forks source link

IFU-master-2021-09-29 #43

Closed rraminen closed 2 years ago

rraminen commented 3 years ago

IFU

The below conflicts have been resolved: CONFLICT (content): Merge conflict in tests/unit/test_config.py CONFLICT (content): Merge conflict in deepspeed/runtime/zero/stage2.py

test_config.log stage2.log

@jithunnair-amd

rraminen commented 3 years ago

Local tests:

Unit tests summary: =========================== short test summary info ============================ FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer ====== 2 failed, 371 passed, 98 skipped, 1 warning in 3199.32s (0:53:19) =======

Bing BERT - No issues

Megatron LM v1.1.5 345 M param model - No issues

jithunnair-amd commented 3 years ago

CI Unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/26/artifact/DeepSpeed/unit_tests_py3.6.log showed the same errors as local run:

FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer

Later CI unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/27/artifact/DeepSpeed/unit_tests_py3.6.log aborted with timeout.

jithunnair-amd commented 3 years ago

GPT2 CI build is giving wrong signal (says passing when it's actually failing). Can we see if we can rectify it?

rraminen commented 2 years ago

The reasons for keeping this PR open

  1. Evaluating CIs
  2. Implementing 8.3 B param model of Megatron-LM v1.1.5 gpt2 and updating the script in pytorch-deepspeed-pr-build-gpt2 CI
jithunnair-amd commented 2 years ago

PR-to-CI issues are still unresolved, but @rraminen will continue to work on them. As for 8.3B param GPT2, script for running it with Megatron1.1.5 and Zero3 has been added in https://github.com/ROCmSoftwarePlatform/DeepSpeedExamples/pull/13. We'll need to update the DeepSpeedExamples commit and then use this new script in the CI.