Closed rraminen closed 2 years ago
@rraminen Can you please gather some triage info regarding when these tests started failing esp. around IFU points?
As of now, all the tests pass with DEEPSPEED_TEST_WITH_ROCM=1. @jithunnair-amd , please let me know if its okay to close this PR?
Closing this PR
tests/unit/test_configurable_parallel.py::TestConfigurableMP::test_gpt2_mp_2to4 fails with the error: Process Process-5: Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, *self._kwargs) File "/root/DeepSpeed/tests/unit/common.py", line 62, in dist_init run_func(func_args, func_kwargs) File "/root/DeepSpeed/tests/unit/test_configurable_parallel.py", line 171, in _run_resize model = self.get_deepspeed_model(model, tmpdir) File "/root/DeepSpeed/tests/unit/test_configurable_parallel.py", line 59, in get_deepspeed_model model_parameters=model.parameters()) File "/opt/conda/lib/python3.6/site-packages/deepspeed/init.py", line 136, in initialize config_params=config_params) File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 147, in init self._configure_with_arguments(args, mpu) File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 590, in _configure_with_arguments self._config = DeepSpeedConfig(self.config, mpu) File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/config.py", line 664, in init object_pairs_hook=dict_raise_error_on_duplicate_keys) File "/opt/conda/lib/python3.6/json/init.py", line 299, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, kw) File "/opt/conda/lib/python3.6/json/init.py", line 367, in loads return cls(**kw).decode(s) File "/opt/conda/lib/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/opt/conda/lib/python3.6/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
tests/unit/test_configurable_parallel.py::TestConfigurablePP::test_pp_basic fails with the error: RuntimeError: Connection reset by peer