ROCm / DeepSpeed

DeepSpeed is a deep learning optimization library that makes distributed training easy, efficient, and effective.
https://www.deepspeed.ai/
Apache License 2.0
4 stars 3 forks source link

Skip failing tests on ROCm #39

Closed rraminen closed 2 years ago

rraminen commented 2 years ago

tests/unit/test_configurable_parallel.py::TestConfigurableMP::test_gpt2_mp_2to4 fails with the error: Process Process-5: Traceback (most recent call last): File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap self.run() File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 93, in run self._target(*self._args, *self._kwargs) File "/root/DeepSpeed/tests/unit/common.py", line 62, in dist_init run_func(func_args, func_kwargs) File "/root/DeepSpeed/tests/unit/test_configurable_parallel.py", line 171, in _run_resize model = self.get_deepspeed_model(model, tmpdir) File "/root/DeepSpeed/tests/unit/test_configurable_parallel.py", line 59, in get_deepspeed_model model_parameters=model.parameters()) File "/opt/conda/lib/python3.6/site-packages/deepspeed/init.py", line 136, in initialize config_params=config_params) File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 147, in init self._configure_with_arguments(args, mpu) File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/engine.py", line 590, in _configure_with_arguments self._config = DeepSpeedConfig(self.config, mpu) File "/opt/conda/lib/python3.6/site-packages/deepspeed/runtime/config.py", line 664, in init object_pairs_hook=dict_raise_error_on_duplicate_keys) File "/opt/conda/lib/python3.6/json/init.py", line 299, in load parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, kw) File "/opt/conda/lib/python3.6/json/init.py", line 367, in loads return cls(**kw).decode(s) File "/opt/conda/lib/python3.6/json/decoder.py", line 339, in decode obj, end = self.raw_decode(s, idx=_w(s, 0).end()) File "/opt/conda/lib/python3.6/json/decoder.py", line 357, in raw_decode raise JSONDecodeError("Expecting value", s, err.value) from None json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

tests/unit/test_configurable_parallel.py::TestConfigurablePP::test_pp_basic fails with the error: RuntimeError: Connection reset by peer

jithunnair-amd commented 2 years ago

@rraminen Can you please gather some triage info regarding when these tests started failing esp. around IFU points?

rraminen commented 2 years ago

As of now, all the tests pass with DEEPSPEED_TEST_WITH_ROCM=1. @jithunnair-amd , please let me know if its okay to close this PR?

rraminen commented 2 years ago

Closing this PR