NVIDIA / NeMo-Framework-Launcher

Provides end-to-end model development pipelines for LLMs and Multimodal models that can be launched on-prem or cloud-native.
Apache License 2.0
471 stars 138 forks source link

Docker Build Fails #184

Open TaekyungHeo opened 10 months ago

TaekyungHeo commented 10 months ago

Issue Description

When attempting to build a Docker image using the latest branch of the NeMo-Megatron-Launcher, the build fails.

Steps to Reproduce

Run the Docker build command:

$ docker build .
...
52.26   WARNING: Missing build requirements in pyproject.toml for megatron-core==0.4.0 from https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from nemo-toolkit==1.21.0rc0).
52.26   WARNING: The project does not specify a build backend, and pip cannot fall back to setuptools without 'wheel'.
52.26   Getting requirements to build wheel: started
52.70   Getting requirements to build wheel: finished with status 'error'
52.70   ERROR: Command errored out with exit status 1:
52.70    command: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmphls_nknx
52.70        cwd: /tmp/pip-install-nn9_ikhg/megatron-core_0c928c7c63d747598ef18d54f6ec6286
52.70   Complete output (18 lines):
52.70   Traceback (most recent call last):
52.70     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 349, in <module>
52.70       main()
52.70     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 331, in main
52.70       json_out['return_val'] = hook(**hook_input['kwargs'])
52.70     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 117, in get_requires_for_build_wheel
52.70       return hook(config_settings)
52.70     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
52.70       return self._get_build_requires(config_settings, requirements=['wheel'])
52.70     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 320, in _get_build_requires
52.70       self.run_setup()
52.70     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 483, in run_setup
52.70       super(_BuildMetaLegacyBackend,
52.70     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 335, in run_setup
52.70       exec(code, locals())
52.70     File "<string>", line 52, in <module>
52.70     File "<string>", line 45, in req_file
52.70   FileNotFoundError: [Errno 2] No such file or directory: 'megatron/core/requirements.txt'
52.70   ----------------------------------------
52.70 WARNING: Discarding https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from https://pypi.org/simple/megatron-core/). Command errored out with exit status 1: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmphls_nknx Check the logs for full command output.
52.70 ERROR: Could not find a version that satisfies the requirement megatron-core==0.4.0; extra == "nlp" (from nemo-toolkit[nlp]) (from versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0)
52.70 ERROR: No matching distribution found for megatron-core==0.4.0; extra == "nlp"
52.93 WARNING: You are using pip version 21.2.4; however, version 23.3.2 is available.
52.93 You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.
------
Dockerfile:110
--------------------
 109 |     ARG NEMO_COMMIT
 110 | >>> RUN git clone https://github.com/NVIDIA/NeMo.git && \
 111 | >>>     cd NeMo && \
 112 | >>>     if [ ! -z $NEMO_COMMIT ]; then \
 113 | >>>         git fetch origin $NEMO_COMMIT && \
 114 | >>>         git checkout FETCH_HEAD; \
 115 | >>>     fi && \
 116 | >>>     pip uninstall -y nemo_toolkit sacrebleu && \
 117 | >>>     pip install -e ".[nlp]" && \
 118 | >>>     cd nemo/collections/nlp/data/language_modeling/megatron && \
 119 | >>>     make
 120 |     
--------------------
ERROR: failed to solve: process "/bin/sh -c git clone https://github.com/NVIDIA/NeMo.git &&     cd NeMo &&     if [ ! -z $NEMO_COMMIT ]; then         git fetch origin $NEMO_COMMIT &&         git checkout FETCH_HEAD;     fi &&     pip uninstall -y nemo_toolkit sacrebleu &&     pip install -e \".[nlp]\" &&     cd nemo/collections/nlp/data/language_modeling/megatron &&     make" did not complete successfully: exit code: 1

Additional Context

TaekyungHeo commented 10 months ago

Related issue: https://github.com/NVIDIA/Megatron-LM/issues/650

JanuszL commented 10 months ago

If:

# Install Megatron-core
ARG MEGATRONCORE_COMMIT
RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
    cd Megatron-LM && \
    if [ ! -z $MEGATRONCORE_COMMIT ]; then \
        git fetch origin $MEGATRONCORE_COMMIT && \
        git checkout FETCH_HEAD; \
    fi && \
    pip install -e .

goes before:


ARG NEMO_COMMIT
RUN git clone https://github.com/NVIDIA/NeMo.git && \
    cd NeMo && \
    if [ ! -z $NEMO_COMMIT ]; then \
        git fetch origin $NEMO_COMMIT && \
        git checkout FETCH_HEAD; \
    fi && \
    pip uninstall -y nemo_toolkit sacrebleu && \
    pip install -e ".[nlp]" && \
    cd nemo/collections/nlp/data/language_modeling/megatron && \
    make```
Then it should be build from source correctly. Now NeMo is installed first and then Megatron-core which is its dependency (should be the reverse).
TaekyungHeo commented 10 months ago

I appreciate your suggestion, @JanuszL ! However, it seems that it is not working. I will wait for the bugfix from the developers.

My diff for Dockerfile:

diff --git Dockerfile Dockerfile
index b250d0d..f823caf 100644
--- Dockerfile
+++ Dockerfile
@@ -105,6 +105,17 @@ RUN pip uninstall -y apex && \
     fi && \
     pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distribut

+# Install Megatron-core
+ARG MEGATRONCORE_COMMIT
+RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
+    cd Megatron-LM && \
+    if [ ! -z $MEGATRONCORE_COMMIT ]; then \
+        git fetch origin $MEGATRONCORE_COMMIT && \
+        git checkout FETCH_HEAD; \
+    fi && \
+    pip install -e .
+
+
 # Install NeMo
 ARG NEMO_COMMIT
 RUN git clone https://github.com/NVIDIA/NeMo.git && \
@@ -131,17 +142,6 @@ RUN git clone https://github.com/NVIDIA/TransformerEngine.git && \
     fi && \
     git submodule init && git submodule update && \
     NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .
-
-# Install Megatron-core
-ARG MEGATRONCORE_COMMIT
-RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
-    cd Megatron-LM && \
-    if [ ! -z $MEGATRONCORE_COMMIT ]; then \
-        git fetch origin $MEGATRONCORE_COMMIT && \
-        git checkout FETCH_HEAD; \
-    fi && \
-    pip install -e .
-
 # Install launch scripts
 COPY . NeMo-Megatron-Launcher
 RUN cd NeMo-Megatron-Launcher && \

Docker build result:

$ docker build .
104.0   WARNING: Missing build requirements in pyproject.toml for megatron-core==0.4.0 from https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from nemo-toolkit==1.21.0rc0).
104.0   WARNING: The project does not specify a build backend, and pip cannot fall back to setuptools without 'wheel'.
104.0   Getting requirements to build wheel: started
104.4   Getting requirements to build wheel: finished with status 'error'
104.4   ERROR: Command errored out with exit status 1:
104.4    command: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpo5jo_szg
104.4        cwd: /tmp/pip-install-t3ld3nxc/megatron-core_34e185c9e75b43368c5fb95d140aec35
104.4   Complete output (18 lines):
104.4   Traceback (most recent call last):
104.4     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 349, in <module>
104.4       main()
104.4     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 331, in main
104.4       json_out['return_val'] = hook(**hook_input['kwargs'])
104.4     File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 117, in get_requires_for_build_wheel
104.4       return hook(config_settings)
104.4     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
104.4       return self._get_build_requires(config_settings, requirements=['wheel'])
104.4     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 320, in _get_build_requires
104.4       self.run_setup()
104.4     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 483, in run_setup
104.4       super(_BuildMetaLegacyBackend,
104.4     File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 335, in run_setup
104.4       exec(code, locals())
104.4     File "<string>", line 52, in <module>
104.4     File "<string>", line 45, in req_file
104.4   FileNotFoundError: [Errno 2] No such file or directory: 'megatron/core/requirements.txt'
104.4   ----------------------------------------
104.4 WARNING: Discarding https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from https://pypi.org/simple/megatron-core/). Command errored out with exit status 1: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpo5jo_szg Check the logs for full command output.
104.4 ERROR: Could not find a version that satisfies the requirement megatron-core==0.4.0; extra == "nlp" (from nemo-toolkit[nlp]) (from versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0)
104.4 ERROR: No matching distribution found for megatron-core==0.4.0; extra == "nlp"
105.4 WARNING: You are using pip version 21.2.4; however, version 23.3.2 is available.
105.4 You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.
------
ERROR: failed to solve: failed to solve with frontend dockerfile.v0: failed to build LLB: executor failed running [/bin/sh -c git clone https://github.com/NVIDIA/NeMo.git &&     cd NeMo &&     if [ ! -z $NEMO_COMMIT ]; then         git fetch origin $NEMO_COMMIT &&         git checkout FETCH_HEAD;     fi &&     pip uninstall -y nemo_toolkit sacrebleu &&     pip install -e ".[nlp]" &&     cd nemo/collections/nlp/data/language_modeling/megatron &&     make]: runc did not terminate sucessfully
JanuszL commented 10 months ago

It seems that the ToT of Megatron-LM build 0.4.0rc0 while NeMo expects 0.4.0. What you can do on top of your Dockerfile change is add --build-arg "MEGATRONCORE_COMMIT=core_v0.4.0" parameter to the docker build cmd.

TaekyungHeo commented 10 months ago

Thanks, @JanuszL. I tried your suggestion both before and after modifying the Dockerfile. Without the modifications, it still prints out the same error. However, when I change the Dockerfile, the pip installation stage takes an unusually long time.

JanuszL commented 10 months ago

@TaekyungHeo thank you for checking. I think I may lack the necessary understanding of the build logic used here. Let us wait for the project maintainers to share their thoughts.

vishakha-lall commented 9 months ago

Not that this is a valid solution, however I was facing the same issue while installing nemo_toolkit[all] and I reverted the version of the package to the previous one nemo-toolkit==1.21.0 released in 2023 as opposed to the current one which released in Jan 2024.