NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
9.23k stars 2.08k forks source link

[BUG] megatron.training not found #870

Closed windprak closed 1 week ago

windprak commented 2 weeks ago

Describe the bug Module megatron.training not found in latest version of megatron_core 0.8.0rc0

Megatron-LM/examples/run_simple_mcore_train_loop.py", line 18, in <module>
    from megatron.training.tokenizer.tokenizer import _NullTokenizer
ModuleNotFoundError: No module named 'megatron.training'

To Reproduce

annotated-types          0.7.0
apex                     0.1
click                    8.1.7
einops                   0.8.0
filelock                 3.13.1
flash_attn               2.4.2
fsspec                   2024.2.0
Jinja2                   3.1.3
joblib                   1.4.2
MarkupSafe               2.1.5
megatron_core            0.8.0rc0              
mpmath                   1.3.0
networkx                 3.2.1
ninja                    1.11.1.1
nltk                     3.8.1
numpy                    1.26.4
nvidia-cublas-cu12       12.4.2.65
nvidia-cuda-cupti-cu12   12.4.99
nvidia-cuda-nvrtc-cu12   12.4.99
nvidia-cuda-runtime-cu12 12.4.99
nvidia-cudnn-cu12        9.1.0.70
nvidia-cufft-cu12        11.2.0.44
nvidia-curand-cu12       10.3.5.119
nvidia-cusolver-cu12     11.6.0.99
nvidia-cusparse-cu12     12.3.0.142
nvidia-nccl-cu12         2.20.5
nvidia-nvjitlink-cu12    12.4.99
nvidia-nvtx-cu12         12.4.99
packaging                24.1
pip                      24.0
pydantic                 2.7.4
pydantic_core            2.18.4
pytorch-triton           3.0.0+45fff310c8
regex                    2024.5.15
sentencepiece            0.2.0
setuptools               69.5.1
sympy                    1.12
torch                    2.4.0.dev20240611+cu124
tqdm                     4.66.4
transformer_engine       1.6.0+c81733f
typing_extensions        4.8.0
wheel                    0.43.0
git clone https://github.com/NVIDIA/Megatron-LM.git

cd Megatron-LM

pip install -e .
cd examples
NUM_GPUS=2
torchrun --nproc-per-node $NUM_GPUS run_simple_mcore_train_loop.py

Expected behavior Import without errors

Kawamiya commented 1 week ago

same problem

windprak commented 1 week ago

I fixed it adding packages=setuptools.find_namespace_packages(include=["megatron.core", "megatron.core.*","megatron.training"]) it to the setup.py

But guess what the "simple" script still crashes: `rank1: File "/home/atuin/b216dc/b216dc10/software/private/conda/envs/megatron/lib/python3.10/site-packages/torch/distributed/checkpoint/default_planner.py", line 389, in create_default_global_save_plan rank1: assert item.index.fqn not in md

`

schheda1 commented 1 week ago

I think if you run it as

PYTHONPATH=$PYTHONPATH:./megatron torchrun --nproc-per-node 2 examples/run_simple_mcore_train_loop.py

it should work. (mentioned in QuickStard.md)