NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.5k stars 2.41k forks source link

NeMo Megatron dataset helper makefile compiles output to write protected container folder (Singularity) #5820

Closed Lauler closed 1 year ago

Lauler commented 1 year ago

Describe the bug

The C++ dataset helper makefile of Megatron in NeMo attempts to write its output to /usr/bin/ld, causing training to crash when using singularity containers build off of your NVIDIA NGC NeMo containers on HPC clusters.

0: [NeMo I 2023-01-18 08:00:59 gpt_dataset:488]  > elasped time to build and save doc-idx mapping (seconds): 40.719332
0: make: Entering directory '/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/megatron'
0: g++ -O3 -Wall -shared -std=c++11 -fPIC -fdiagnostics-color -I/opt/conda/include/python3.8 -I/opt/conda/lib/python3.8/site-packages/pybind11/include helpers.cpp -o helpers.cpython-38-x86_64-linux-gnu.so
0: /usr/bin/ld: cannot open output file helpers.cpython-38-x86_64-linux-gnu.so: Read-only file system
0: collect2: error: ld returned 1 exit status
0: make: *** [Makefile:23: helpers.cpython-38-x86_64-linux-gnu.so] Error 1
0: make: Leaving directory '/opt/conda/lib/python3.8/site-packages/nemo/collections/nlp/data/language_modeling/megatron'
0: [NeMo E 2023-01-18 08:01:30 dataset_utils:83] Making C++ dataset helpers module failed, exiting.
0: [NeMo W 2023-01-18 08:01:31 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py:431: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
0:       rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")

Steps/Code to reproduce bug

Build a singularity container for use in HPC. This is our definition file nemo.def, and we build it locally (outside of HPC environment) via sudo singularity build nemo2209.sif nemo.def:

From: nvcr.io/nvidia/nemo:22.09

%environment
    export LC_ALL=C

We transfer the image nemo2209.sif to HPC, and follow the NeMo GPT training docs.

See this other issue for sbatch config and launch script (changing --nodes=2 to --nodes=1).

Expected behavior

Most users will probably use NeMo Megatron on HPC, where they don't have sudo rights and need to use Singularity instead of Docker. It would be nice if you would test that your documentation examples are launchable with Singularity containers on systems where you do not have root/sudo. A container that is already built should be able to launch training without errors, and without building/compiling extra stuff that needs to be written to write protected folders.

Regular NVIDIA Pytorch containers work out of the box when converted to Singularity containers and used with Megatron-LM.

Environment overview (please complete the following information) HPC cluster, Slurm.

Environment details NGC Nemo containers 22.08 and 22.09.

Additional context

A100 GPUs.

Lauler commented 1 year ago

We solve this on our end by adding %post command during build to change directory to the location where NeMo python package is installed inside the container and run make to compile the c++ dataset script.

Ideally, this file should however be compiled when you guys build the docker container and install NeMo in it @ericharper . Would make it easier to use NeMo containers on HPC.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 7 days since being marked as stale.

savitamittal1 commented 10 months ago

@Lauler , I am running into the same issue. Did you build the helpers.cpp in container? can you share the container with post and make command?

Lauler commented 10 months ago

Identify where python is installed in the container, and where nemo package loated in site-packages and run a make in that folder. Here's my example for nemo:23.02 container:

BootStrap: docker
From: nvcr.io/nvidia/nemo:23.02

%environment
    export LC_ALL=C

%post
    cd /usr/local/lib/python3.8/dist-packages/nemo/collections/nlp/data/language_modeling/megatron
    make
    pip install accelerate
AnirudhVIyer commented 6 months ago

@Lauler how long did it take for your image to be built? I am building it remotely, and it takes a lot of time. Would it be possible for your to share your .sif file?