Open RachitBansal opened 1 month ago
I've seen this a few times if you build for the wrong version of PyTorch and it installs funny. I would print the whole install logs and see if there's any reinstalling going on
I am using the nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container which already has the installation. Do you suggest installing a specific alternate version?
We use and recommend images: https://github.com/mosaicml/composer/tree/main/docker
I am trying to setup and use megablocks to train MoE models, but I see the following error:
I am working on NGC's nvcr.io/nvidia/pytorch:23.09-py3 PyTorch container.
When I try running gpt2 training (using
exp/gpt2/gpt2_gpt2_46m_1gpu.sh
) before doing apip install megablocks
, it works totally fine, while the moe script (exp/moe/moe_125m_8gpu_interactive.sh
) gives the errorMegablocks not available
.However, after I do a
pip install megablocks
orpip install .
in the container, even the gpt2 script (and the MoE one) starts giving the above error regarding amp_C and undefined symbol.