databricks / megablocks

Apache License 2.0
1.22k stars 175 forks source link

amp_C undefined symbol after installing Megablocks #157

Open RachitBansal opened 1 month ago

RachitBansal commented 1 month ago

I am trying to setup and use megablocks to train MoE models, but I see the following error:

Traceback (most recent call last):
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/pretrain_gpt.py", line 8, in <module>
    from megatron import get_args
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/__init__.py", line 13, in <module>
    from .initialize  import initialize_megatron
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/initialize.py", line 19, in <module>
    from megatron.checkpointing import load_args_from_checkpoint
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/checkpointing.py", line 15, in <module>
    from .utils import (unwrap_model,
  File "/n/holyscratch01/dam_lab/brachit/moes/megablocks/third_party/Megatron-LM/megatron/utils.py", line 11, in <module>
    import amp_C
ImportError: /usr/local/lib/python3.10/dist-packages/amp_C.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE

I am working on NGC's nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container.

When I try running gpt2 training (using exp/gpt2/gpt2_gpt2_46m_1gpu.sh) before doing a pip install megablocks, it works totally fine, while the moe script (exp/moe/moe_125m_8gpu_interactive.sh) gives the error Megablocks not available.

However, after I do a pip install megablocks or pip install . in the container, even the gpt2 script (and the MoE one) starts giving the above error regarding amp_C and undefined symbol.

mvpatel2000 commented 1 month ago

I've seen this a few times if you build for the wrong version of PyTorch and it installs funny. I would print the whole install logs and see if there's any reinstalling going on

RachitBansal commented 1 month ago

I am using the nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container which already has the installation. Do you suggest installing a specific alternate version?

mvpatel2000 commented 1 month ago

We use and recommend images: https://github.com/mosaicml/composer/tree/main/docker