NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.65k stars 2.39k forks source link

Add fallbacks for c++ extension + jit_fuser #1235

Closed marcromeyn closed 1 month ago

marcromeyn commented 1 month ago

In NeMo-Run we offer to configure an experiment locally but execute it remotely through a wide variety of different executors. For the best UX we require mcore to be able to be installed locally (eventhough we don't intend to train models locally). Currently there are some issues to make a pip-install fail.

  1. The C++ extension, I propose to log a warning when this fails but don't fail the entire installation.
  2. The jit_fuser can fail on newer python installs. I propose to make it a no-op when it fails. Current error is:
  from megatron.core.transformer.utils import (
  File "/Users/mromeijn/base/code/.venv/lib/python3.12/site-packages/megatron/core/transformer/utils.py", line 40, in <module>
    @jit_fuser
     ^^^^^^^^^
  File "/Users/mromeijn/base/code/.venv/lib/python3.12/site-packages/lightning_fabric/wrappers.py", line 411, in _capture
    return compile_fn(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mromeijn/base/code/.venv/lib/python3.12/site-packages/torch/__init__.py", line 1868, in compile
    raise RuntimeError("Dynamo is not supported on Python 3.12+")
RuntimeError: Dynamo is not supported on Python 3.12+
ko3n1g commented 1 month ago

Resolved it offline