Closed jahatef closed 5 months ago
Fixed workflows by specifying python versions, and installing packages before running tests. The pip install will exit with "requirement already satisfied" if the package is already installed, which should be fine. I also updated some requirements in requirements.txt
: I pulled the commit hash off deeper speed (which I'm not sure if we want), and I updated the numpy
requirement to be <2.0, which is required or it breaks deep speed.
Tests will run, although it seems as though some tests currently fail with no access to a gpu, and some fail with reasons seemingly unrelated to the workflows. See https://github.com/EleutherAI/gpt-neox/actions/runs/9555032138/job/26337367665
Fixed workflows by specifying python versions, and installing packages before running tests. The pip install will exit with "requirement already satisfied" if the package is already installed, which should be fine. I also updated some requirements in
requirements.txt
: I pulled the commit hash off deeper speed (which I'm not sure if we want), and I updated thenumpy
requirement to be <2.0, which is required or it breaks deep speed.Tests will run, although it seems as though some tests currently fail with no access to a gpu, and some fail with reasons seemingly unrelated to the workflows. See https://github.com/EleutherAI/gpt-neox/actions/runs/9555032138/job/26337367665
Here's the relevant trace from the runner, for future reference.
____________________________ test_main_constructor _____________________________
def test_main_constructor():
input_args = ["train.py", "tests/config/test_setup.yml"]
> neox_args = NeoXArgs.consume_deepy_args(input_args)
tests/unit/test_arguments.py:21:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
megatron/neox_arguments/arguments.py:371: in consume_deepy_args
neox_args = cls.from_ymls(
megatron/neox_arguments/arguments.py:229: in from_ymls
return cls(**config)
<string>:266: in __init__
???
megatron/neox_arguments/arguments.py:134: in __post_init__
self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''
def obtain_resource_pool(
hostfile_path, include_arg, exclude_arg
) -> Dict[str, List[int]]:
"""
Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
Modified from: `deepspeed.launcher.runner.main`
"""
resource_pool = fetch_hostfile(hostfile_path)
if not resource_pool:
resource_pool = {}
device_count = torch.cuda.device_count()
if device_count == 0:
> raise RuntimeError("Unable to proceed, no GPU resources available")
E RuntimeError: Unable to proceed, no GPU resources available
megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:57,005] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
__________________________ test_constructor_from_ymls __________________________
def test_constructor_from_ymls():
t1 = test_constructor_from_ymls_class()
> t1.test()
tests/unit/test_arguments.py:37:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/unit/test_arguments.py:31: in test
neox_args = NeoXArgs.from_ymls(["tests/config/test_setup.yml"])
megatron/neox_arguments/arguments.py:229: in from_ymls
return cls(**config)
<string>:266: in __init__
???
megatron/neox_arguments/arguments.py:134: in __post_init__
self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''
def obtain_resource_pool(
hostfile_path, include_arg, exclude_arg
) -> Dict[str, List[int]]:
"""
Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
Modified from: `deepspeed.launcher.runner.main`
"""
resource_pool = fetch_hostfile(hostfile_path)
if not resource_pool:
resource_pool = {}
device_count = torch.cuda.device_count()
if device_count == 0:
> raise RuntimeError("Unable to proceed, no GPU resources available")
E RuntimeError: Unable to proceed, no GPU resources available
megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:57,294] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
__________________________ test_constructor_from_dict __________________________
def test_constructor_from_dict():
t1 = test_constructor_from_dict_class()
> t1.test()
tests/unit/test_arguments.py:49:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/unit/test_arguments.py:44: in test
neox_args = NeoXArgs.from_dict(BASE_CONFIG)
megatron/neox_arguments/arguments.py:236: in from_dict
return cls(**args_dict)
<string>:266: in __init__
???
megatron/neox_arguments/arguments.py:134: in __post_init__
self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''
def obtain_resource_pool(
hostfile_path, include_arg, exclude_arg
) -> Dict[str, List[int]]:
"""
Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
Modified from: `deepspeed.launcher.runner.main`
"""
resource_pool = fetch_hostfile(hostfile_path)
if not resource_pool:
resource_pool = {}
device_count = torch.cuda.device_count()
if device_count == 0:
> raise RuntimeError("Unable to proceed, no GPU resources available")
E RuntimeError: Unable to proceed, no GPU resources available
megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
Warning: 17 21:32:57,574] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
_________________________ test_gpt_neox_to_huggingface _________________________
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f278be35b70>
tmpdir = local('/tmp/pytest-of-root/pytest-1/test_gpt_neox_to_huggingface0')
tmp_path = PosixPath('/tmp/pytest-of-root/pytest-1/test_gpt_neox_to_huggingface0')
def test_gpt_neox_to_huggingface(monkeypatch, tmpdir, tmp_path):
# Generate random GPT-NEOX model, check we can convert to hf format
model_dir = str(tmpdir)
input_args = ["train.py", "tests/config/test_setup.yml"]
> deepspeed_main_args = simulate_deepy_env(monkeypatch, input_args)
tests/unit/test_format_conversion_scripts.py:11:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
tests/common.py:523: in simulate_deepy_env
neox_args = NeoXArgs.consume_deepy_args(input_args)
megatron/neox_arguments/arguments.py:371: in consume_deepy_args
neox_args = cls.from_ymls(
megatron/neox_arguments/arguments.py:229: in from_ymls
return cls(**config)
<string>:266: in __init__
???
megatron/neox_arguments/arguments.py:134: in __post_init__
self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''
def obtain_resource_pool(
hostfile_path, include_arg, exclude_arg
) -> Dict[str, List[int]]:
"""
Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
Modified from: `deepspeed.launcher.runner.main`
"""
resource_pool = fetch_hostfile(hostfile_path)
if not resource_pool:
resource_pool = {}
device_count = torch.cuda.device_count()
if device_count == 0:
> raise RuntimeError("Unable to proceed, no GPU resources available")
E RuntimeError: Unable to proceed, no GPU resources available
megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:58,104] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
=============================== warnings summary ===============================
<string>:8
<string>:8: PytestDeprecationWarning: A private pytest class or function was used.
-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/neox_args/test_neoxargs_usage.py::test_neoxargs_usage
FAILED tests/unit/test_arguments.py::test_main_constructor
FAILED tests/unit/test_arguments.py::test_constructor_from_ymls
FAILED tests/unit/test_arguments.py::test_constructor_from_dict
FAILED tests/unit/test_format_conversion_scripts.py::test_gpt_neox_to_huggingface
======= 5 failed, 24 passed, 92 skipped, 80 xfailed, 1 warning in 28.89s =======
Error: Process completed with exit code 1.
@jahatef -- Why remove the commit hash from deeperspeed, but leave it for lm_dataformat?
No good reason, it was a 4 month old version, which I'm not sure we want to be the default for users. I can add the hash back or remove it for the other package. I don't believe it was the cause of the issues I saw, I think it was the numpy version that caused problems with deep speed.
Commit hash was added back to the Deepspeed version.
Possibly fix workflow issues. Needs to be tested in PR.