NVIDIA / NeMo-Run

A tool to configure, launch and manage your machine learning experiments.
Apache License 2.0
78 stars 20 forks source link

zlib.error: Error -3 while decompressing data: incorrect header check #97

Open RachitBansal opened 1 month ago

RachitBansal commented 1 month ago

Running into this deserialization issue in src/nemo_run/core/runners/fdl_runner.py.

Traceback (most recent call last):
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in <module>
    fdl_runner_app()
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 326, in __call__
    raise e
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 309, in __call__
    return get_command(self)(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 661, in main
    return _main(
  File "/usr/local/lib/python3.10/dist-packages/typer/core.py", line 193, in _main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/dist-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/typer/main.py", line 692, in wrapper
    return callback(**use_params)
  File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 47, in fdl_direct_run
    fdl_deser_package: fdl.Buildable = ZlibJSONSerializer().deserialize(fdl_package_cfg)
  File "/opt/NeMo-Run/src/nemo_run/core/serialization/zlib_json.py", line 42, in deserialize
    zlib.decompress(base64.urlsafe_b64decode(serialized)).decode(),
zlib.error: Error -3 while decompressing data: incorrect header check

Context: I am running a pretraining example job: python scripts/llm/llama3_pretraining.py --size=8b --slurm The srun command being launched looks like this: srun --output /path/to/logfile.out python -m nemo_run.core.runners.fdl_runner -n llama3-8b \ -p /nemo_run/configs/llama3-8b_packager /nemo_run/configs/llama3-8b_fn_or_script

A probable cause of the issue is that the cfg being passed in this srun /nemo_run/configs/llama3-8b_fn_or_script does not exist. I can't find it anywhere in my filesystem. However, I don't understand what to change.

Any pointers would be useful!

CC: @hemildesai

hemildesai commented 1 month ago

Hi, for slurm, we currently only support clusters that have https://github.com/NVIDIA/pyxis enabled. With pyxis, we mount the configs at /nemo_run/configs/ using --container-mounts option in srun automatically. Does your cluster have pyxis enabled?