fix python version and pytest install

jahatef commented 5 months ago

Possibly fix workflow issues. Needs to be tested in PR.

jahatef commented 5 months ago

Fixed workflows by specifying python versions, and installing packages before running tests. The pip install will exit with "requirement already satisfied" if the package is already installed, which should be fine. I also updated some requirements in requirements.txt: I pulled the commit hash off deeper speed (which I'm not sure if we want), and I updated the numpy requirement to be <2.0, which is required or it breaks deep speed.

Tests will run, although it seems as though some tests currently fail with no access to a gpu, and some fail with reasons seemingly unrelated to the workflows. See https://github.com/EleutherAI/gpt-neox/actions/runs/9555032138/job/26337367665

Quentin-Anthony commented 5 months ago

Fixed workflows by specifying python versions, and installing packages before running tests. The pip install will exit with "requirement already satisfied" if the package is already installed, which should be fine. I also updated some requirements in requirements.txt: I pulled the commit hash off deeper speed (which I'm not sure if we want), and I updated the numpy requirement to be <2.0, which is required or it breaks deep speed.

Tests will run, although it seems as though some tests currently fail with no access to a gpu, and some fail with reasons seemingly unrelated to the workflows. See https://github.com/EleutherAI/gpt-neox/actions/runs/9555032138/job/26337367665

Here's the relevant trace from the runner, for future reference.

____________________________ test_main_constructor _____________________________
def test_main_constructor():
        input_args = ["train.py", "tests/config/test_setup.yml"]
>       neox_args = NeoXArgs.consume_deepy_args(input_args)

tests/unit/test_arguments.py:21: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
megatron/neox_arguments/arguments.py:371: in consume_deepy_args
    neox_args = cls.from_ymls(
megatron/neox_arguments/arguments.py:229: in from_ymls
    return cls(**config)
<string>:266: in __init__
    ???
megatron/neox_arguments/arguments.py:134: in __post_init__
    self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
    resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''

    def obtain_resource_pool(
        hostfile_path, include_arg, exclude_arg
    ) -> Dict[str, List[int]]:
        """
        Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
        Modified from: `deepspeed.launcher.runner.main`
        """
        resource_pool = fetch_hostfile(hostfile_path)
        if not resource_pool:
            resource_pool = {}
            device_count = torch.cuda.device_count()
            if device_count == 0:
>               raise RuntimeError("Unable to proceed, no GPU resources available")
E               RuntimeError: Unable to proceed, no GPU resources available

megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:57,005] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
__________________________ test_constructor_from_ymls __________________________
def test_constructor_from_ymls():
        t1 = test_constructor_from_ymls_class()
>       t1.test()

tests/unit/test_arguments.py:37: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/unit/test_arguments.py:31: in test
    neox_args = NeoXArgs.from_ymls(["tests/config/test_setup.yml"])
megatron/neox_arguments/arguments.py:229: in from_ymls
    return cls(**config)
<string>:266: in __init__
    ???
megatron/neox_arguments/arguments.py:134: in __post_init__
    self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
    resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''

    def obtain_resource_pool(
        hostfile_path, include_arg, exclude_arg
    ) -> Dict[str, List[int]]:
        """
        Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
        Modified from: `deepspeed.launcher.runner.main`
        """
        resource_pool = fetch_hostfile(hostfile_path)
        if not resource_pool:
            resource_pool = {}
            device_count = torch.cuda.device_count()
            if device_count == 0:
>               raise RuntimeError("Unable to proceed, no GPU resources available")
E               RuntimeError: Unable to proceed, no GPU resources available
megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:57,294] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
__________________________ test_constructor_from_dict __________________________
def test_constructor_from_dict():
        t1 = test_constructor_from_dict_class()
>       t1.test()

tests/unit/test_arguments.py:49: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/unit/test_arguments.py:44: in test
    neox_args = NeoXArgs.from_dict(BASE_CONFIG)
megatron/neox_arguments/arguments.py:236: in from_dict
    return cls(**args_dict)
<string>:266: in __init__
    ???
megatron/neox_arguments/arguments.py:134: in __post_init__
    self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
    resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''
    def obtain_resource_pool(
        hostfile_path, include_arg, exclude_arg
    ) -> Dict[str, List[int]]:
        """
        Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
        Modified from: `deepspeed.launcher.runner.main`
        """
        resource_pool = fetch_hostfile(hostfile_path)
        if not resource_pool:
            resource_pool = {}
            device_count = torch.cuda.device_count()
            if device_count == 0:
>               raise RuntimeError("Unable to proceed, no GPU resources available")
E               RuntimeError: Unable to proceed, no GPU resources available

megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
Warning: 17 21:32:57,574] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
_________________________ test_gpt_neox_to_huggingface _________________________
monkeypatch = <_pytest.monkeypatch.MonkeyPatch object at 0x7f278be35b70>
tmpdir = local('/tmp/pytest-of-root/pytest-1/test_gpt_neox_to_huggingface0')
tmp_path = PosixPath('/tmp/pytest-of-root/pytest-1/test_gpt_neox_to_huggingface0')

    def test_gpt_neox_to_huggingface(monkeypatch, tmpdir, tmp_path):
        # Generate random GPT-NEOX model, check we can convert to hf format
        model_dir = str(tmpdir)
        input_args = ["train.py", "tests/config/test_setup.yml"]
>       deepspeed_main_args = simulate_deepy_env(monkeypatch, input_args)

tests/unit/test_format_conversion_scripts.py:11: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/common.py:523: in simulate_deepy_env
    neox_args = NeoXArgs.consume_deepy_args(input_args)
megatron/neox_arguments/arguments.py:371: in consume_deepy_args
    neox_args = cls.from_ymls(
megatron/neox_arguments/arguments.py:229: in from_ymls
    return cls(**config)
<string>:266: in __init__
    ???
megatron/neox_arguments/arguments.py:134: in __post_init__
    self.calculate_derived()
megatron/neox_arguments/arguments.py:836: in calculate_derived
    resources = obtain_resource_pool(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hostfile_path = 'None', include_arg = 'localhost:1', exclude_arg = ''

    def obtain_resource_pool(
        hostfile_path, include_arg, exclude_arg
    ) -> Dict[str, List[int]]:
        """
        Get dict of `resource_pool[hostname] = [list of GPU ranks]` using hostfile, include and exclude args.
        Modified from: `deepspeed.launcher.runner.main`
        """
        resource_pool = fetch_hostfile(hostfile_path)
        if not resource_pool:
            resource_pool = {}
            device_count = torch.cuda.device_count()
            if device_count == 0:
>               raise RuntimeError("Unable to proceed, no GPU resources available")
E               RuntimeError: Unable to proceed, no GPU resources available

megatron/utils.py:201: RuntimeError
----------------------------- Captured stdout call -----------------------------
NeoXArgs.from_ymls() ['tests/config/test_setup.yml']
Warning: 17 21:32:58,104] [WARNING] [runner.py:217:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
=============================== warnings summary ===============================
<string>:8
  <string>:8: PytestDeprecationWarning: A private pytest class or function was used.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
=========================== short test summary info ============================
FAILED tests/neox_args/test_neoxargs_usage.py::test_neoxargs_usage
FAILED tests/unit/test_arguments.py::test_main_constructor
FAILED tests/unit/test_arguments.py::test_constructor_from_ymls
FAILED tests/unit/test_arguments.py::test_constructor_from_dict
FAILED tests/unit/test_format_conversion_scripts.py::test_gpt_neox_to_huggingface
======= 5 failed, 24 passed, 92 skipped, 80 xfailed, 1 warning in 28.89s =======
Error: Process completed with exit code 1.

Quentin-Anthony commented 5 months ago

@jahatef -- Why remove the commit hash from deeperspeed, but leave it for lm_dataformat?

jahatef commented 5 months ago

No good reason, it was a 4 month old version, which I'm not sure we want to be the default for users. I can add the hash back or remove it for the other package. I don't believe it was the cause of the issues I saw, I think it was the numpy version that caused problems with deep speed.

jahatef commented 5 months ago

Commit hash was added back to the Deepspeed version.

EleutherAI / gpt-neox

fix python version and pytest install #1234