Open TaekyungHeo opened 10 months ago
Related issue: https://github.com/NVIDIA/Megatron-LM/issues/650
If:
# Install Megatron-core
ARG MEGATRONCORE_COMMIT
RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
cd Megatron-LM && \
if [ ! -z $MEGATRONCORE_COMMIT ]; then \
git fetch origin $MEGATRONCORE_COMMIT && \
git checkout FETCH_HEAD; \
fi && \
pip install -e .
goes before:
ARG NEMO_COMMIT
RUN git clone https://github.com/NVIDIA/NeMo.git && \
cd NeMo && \
if [ ! -z $NEMO_COMMIT ]; then \
git fetch origin $NEMO_COMMIT && \
git checkout FETCH_HEAD; \
fi && \
pip uninstall -y nemo_toolkit sacrebleu && \
pip install -e ".[nlp]" && \
cd nemo/collections/nlp/data/language_modeling/megatron && \
make```
Then it should be build from source correctly. Now NeMo is installed first and then Megatron-core which is its dependency (should be the reverse).
I appreciate your suggestion, @JanuszL ! However, it seems that it is not working. I will wait for the bugfix from the developers.
My diff for Dockerfile:
diff --git Dockerfile Dockerfile
index b250d0d..f823caf 100644
--- Dockerfile
+++ Dockerfile
@@ -105,6 +105,17 @@ RUN pip uninstall -y apex && \
fi && \
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_layer_norm" --global-option="--distribut
+# Install Megatron-core
+ARG MEGATRONCORE_COMMIT
+RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
+ cd Megatron-LM && \
+ if [ ! -z $MEGATRONCORE_COMMIT ]; then \
+ git fetch origin $MEGATRONCORE_COMMIT && \
+ git checkout FETCH_HEAD; \
+ fi && \
+ pip install -e .
+
+
# Install NeMo
ARG NEMO_COMMIT
RUN git clone https://github.com/NVIDIA/NeMo.git && \
@@ -131,17 +142,6 @@ RUN git clone https://github.com/NVIDIA/TransformerEngine.git && \
fi && \
git submodule init && git submodule update && \
NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .
-
-# Install Megatron-core
-ARG MEGATRONCORE_COMMIT
-RUN git clone https://github.com/NVIDIA/Megatron-LM.git && \
- cd Megatron-LM && \
- if [ ! -z $MEGATRONCORE_COMMIT ]; then \
- git fetch origin $MEGATRONCORE_COMMIT && \
- git checkout FETCH_HEAD; \
- fi && \
- pip install -e .
-
# Install launch scripts
COPY . NeMo-Megatron-Launcher
RUN cd NeMo-Megatron-Launcher && \
Docker build result:
$ docker build .
104.0 WARNING: Missing build requirements in pyproject.toml for megatron-core==0.4.0 from https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from nemo-toolkit==1.21.0rc0).
104.0 WARNING: The project does not specify a build backend, and pip cannot fall back to setuptools without 'wheel'.
104.0 Getting requirements to build wheel: started
104.4 Getting requirements to build wheel: finished with status 'error'
104.4 ERROR: Command errored out with exit status 1:
104.4 command: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpo5jo_szg
104.4 cwd: /tmp/pip-install-t3ld3nxc/megatron-core_34e185c9e75b43368c5fb95d140aec35
104.4 Complete output (18 lines):
104.4 Traceback (most recent call last):
104.4 File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 349, in <module>
104.4 main()
104.4 File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 331, in main
104.4 json_out['return_val'] = hook(**hook_input['kwargs'])
104.4 File "/usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py", line 117, in get_requires_for_build_wheel
104.4 return hook(config_settings)
104.4 File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 338, in get_requires_for_build_wheel
104.4 return self._get_build_requires(config_settings, requirements=['wheel'])
104.4 File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 320, in _get_build_requires
104.4 self.run_setup()
104.4 File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 483, in run_setup
104.4 super(_BuildMetaLegacyBackend,
104.4 File "/usr/local/lib/python3.8/dist-packages/setuptools/build_meta.py", line 335, in run_setup
104.4 exec(code, locals())
104.4 File "<string>", line 52, in <module>
104.4 File "<string>", line 45, in req_file
104.4 FileNotFoundError: [Errno 2] No such file or directory: 'megatron/core/requirements.txt'
104.4 ----------------------------------------
104.4 WARNING: Discarding https://files.pythonhosted.org/packages/fd/b9/e85da25f4de43dad70d6fd1c21b88db085f471d5348c51cce05dc9e4b0ef/megatron_core-0.4.0.tar.gz#sha256=bb2cd1f4c5746b31a8b4abd676820ddceec272f002873801a519dbbf1352d8ef (from https://pypi.org/simple/megatron-core/). Command errored out with exit status 1: /usr/bin/python /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/in_process/_in_process.py get_requires_for_build_wheel /tmp/tmpo5jo_szg Check the logs for full command output.
104.4 ERROR: Could not find a version that satisfies the requirement megatron-core==0.4.0; extra == "nlp" (from nemo-toolkit[nlp]) (from versions: 0.1.0, 0.2.0, 0.3.0, 0.4.0)
104.4 ERROR: No matching distribution found for megatron-core==0.4.0; extra == "nlp"
105.4 WARNING: You are using pip version 21.2.4; however, version 23.3.2 is available.
105.4 You should consider upgrading via the '/usr/bin/python -m pip install --upgrade pip' command.
------
ERROR: failed to solve: failed to solve with frontend dockerfile.v0: failed to build LLB: executor failed running [/bin/sh -c git clone https://github.com/NVIDIA/NeMo.git && cd NeMo && if [ ! -z $NEMO_COMMIT ]; then git fetch origin $NEMO_COMMIT && git checkout FETCH_HEAD; fi && pip uninstall -y nemo_toolkit sacrebleu && pip install -e ".[nlp]" && cd nemo/collections/nlp/data/language_modeling/megatron && make]: runc did not terminate sucessfully
It seems that the ToT of Megatron-LM build 0.4.0rc0 while NeMo expects 0.4.0.
What you can do on top of your Dockerfile change is add --build-arg "MEGATRONCORE_COMMIT=core_v0.4.0"
parameter to the docker build cmd.
Thanks, @JanuszL. I tried your suggestion both before and after modifying the Dockerfile. Without the modifications, it still prints out the same error. However, when I change the Dockerfile, the pip installation stage takes an unusually long time.
@TaekyungHeo thank you for checking. I think I may lack the necessary understanding of the build logic used here. Let us wait for the project maintainers to share their thoughts.
Not that this is a valid solution, however I was facing the same issue while installing nemo_toolkit[all] and I reverted the version of the package to the previous one nemo-toolkit==1.21.0 released in 2023 as opposed to the current one which released in Jan 2024.
Issue Description
When attempting to build a Docker image using the latest branch of the NeMo-Megatron-Launcher, the build fails.
Steps to Reproduce
Run the Docker build command:
Additional Context
megatron_core==0.4.0
package, which is installed as part of the Docker build process.megatron_core
team. Peter Dykas replied that we need to use python3.10.--build-arg NEMO_COMMIT=c7948b26a00c91a7332d9eb04f4d66725e9d62e3
) installs a previous megatron package (0.3.0) but leads to failure in the data preparation stage, possibly due to other issues resolved in the latest NeMo version.