Closed rraminen closed 2 years ago
Local tests:
Unit tests summary: =========================== short test summary info ============================ FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer ====== 2 failed, 371 passed, 98 skipped, 1 warning in 3199.32s (0:53:19) =======
Bing BERT - No issues
Megatron LM v1.1.5 345 M param model - No issues
CI Unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/26/artifact/DeepSpeed/unit_tests_py3.6.log showed the same errors as local run:
FAILED tests/unit/test_checkpointing.py::test_checkpoint_unfused_optimizer
FAILED tests/unit/test_checkpointing.py::test_checkpoint_fused_optimizer
Later CI unit test build http://rocmhead.amd.com:8080/job/pytorch-deepspeed-pr-build-unit-tests/27/artifact/DeepSpeed/unit_tests_py3.6.log aborted with timeout.
GPT2 CI build is giving wrong signal (says passing when it's actually failing). Can we see if we can rectify it?
The reasons for keeping this PR open
PR-to-CI issues are still unresolved, but @rraminen will continue to work on them. As for 8.3B param GPT2, script for running it with Megatron1.1.5 and Zero3 has been added in https://github.com/ROCmSoftwarePlatform/DeepSpeedExamples/pull/13. We'll need to update the DeepSpeedExamples commit and then use this new script in the CI.
IFU
The below conflicts have been resolved: CONFLICT (content): Merge conflict in tests/unit/test_config.py CONFLICT (content): Merge conflict in deepspeed/runtime/zero/stage2.py
test_config.log stage2.log
@jithunnair-amd