Docker build failing. Also, is there a .nemo reward model file available?

rundiffusion commented 6 months ago

Unfortunately this issue spans across two repos and I'll try and contextualize what I need fixed from this repo to this repo.

I'm following this research: https://developer.nvidia.com/blog/enhance-text-to-image-fine-tuning-with-draft-now-part-of-nvidia-nemo/ And the tutorial here: https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/draftp.html#model-aligner-draftp

I'm trying to build from the Dockerfile and getting an error deep into the build process: 308.8 [31/33] Building CUDA object common/CMakeFiles/transformer_engine.dir/fused_attn/utils.cu.o 308.8 [32/33] Building CUDA object common/CMakeFiles/transformer_engine.dir/layer_norm/ln_fwd_cuda_kernel.cu.o 308.8 ninja: build stopped: subcommand failed. 308.8 Traceback (most recent call last): 308.8 File "/workspace/TransformerEngine/setup.py", line 356, in _build_cmake 308.8 subprocess.run(command, cwd=build_dir, check=True) 308.8 File "/usr/lib/python3.10/subprocess.py", line 526, in run 308.8 raise CalledProcessError(retcode, process.args, 308.8 subprocess.CalledProcessError: Command '['/usr/local/lib/python3.10/dist-packages/cmake/data/bin/cmake', '--build', '/tmp/tmpgp_h_7iw']' returned non-zero exit status 1. 308.8 308.8 During handling of the above exception, another exception occurred: 308.8 308.8 Traceback (most recent call last): 308.8 File "", line 2, in 308.8 File "", line 34, in 308.8 File "/workspace/TransformerEngine/setup.py", line 629, in 308.8 main() 308.8 File "/workspace/TransformerEngine/setup.py", line 614, in main 308.8 setuptools.setup( 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/init.py", line 103, in setup 308.8 return distutils.core.setup(**attrs) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 185, in setup 308.8 return run_commands(dist) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/core.py", line 201, in run_commands 308.8 dist.run_commands() 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 969, in run_commands 308.8 self.run_command(cmd) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command 308.8 super().run_command(command) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command 308.8 cmd_obj.run() 308.8 File "/usr/local/lib/python3.10/dist-packages/wheel/bdist_wheel.py", line 368, in run 308.8 self.run_command("build") 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command 308.8 self.distribution.run_command(command) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command 308.8 super().run_command(command) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command 308.8 cmd_obj.run() 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/command/build.py", line 131, in run 308.8 self.run_command(cmd_name) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/cmd.py", line 318, in run_command 308.8 self.distribution.run_command(command) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/dist.py", line 989, in run_command 308.8 super().run_command(command) 308.8 File "/usr/local/lib/python3.10/dist-packages/setuptools/_distutils/dist.py", line 988, in run_command 308.8 cmd_obj.run() 308.8 File "/workspace/TransformerEngine/setup.py", line 386, in run 308.8 ext._build_cmake( 308.8 File "/workspace/TransformerEngine/setup.py", line 358, in _build_cmake 308.8 raise RuntimeError(f"Error when running CMake: {e}") 308.8 RuntimeError: Error when running CMake: Command '['/usr/local/lib/python3.10/dist-packages/cmake/data/bin/cmake', '--build', '/tmp/tmpgp_h_7iw']' returned non-zero exit status 1. 308.8 [end of output] 308.8 308.8 note: This error originates from a subprocess, and is likely not a problem with pip. 308.8 ERROR: Failed building wheel for transformer-engine 312.2 Building wheel for flash-attn (setup.py): started 316.8 Building wheel for flash-attn (setup.py): finished with status 'done' 316.9 Created wheel for flash-attn: filename=flash_attn-2.4.2-cp310-cp310-linux_x86_64.whl size=113822687 sha256=075eeda487ce2a48319e336306a70e53df42c92d9cb91e12b03a63027b54b145 316.9 Stored in directory: /tmp/pip-ephem-wheel-cache-7b_j_yrl/wheels/9d/cf/7f/d14555553b5b30698dae0a4159fdd058157e7021cec565ecaa 316.9 Successfully built flash-attn 316.9 Failed to build transformer-engine 316.9 ERROR: Could not build wheels for transformer-engine, which is required to install pyproject.toml-based projects 317.3 317.3 [notice] A new release of pip is available: 23.3.2 -> 24.0 317.3 [notice] To update, run: python -m pip install --upgrade pip

Dockerfile:80

79 | # Transformer Engine 1.2.0 80 | >>> RUN git clone https://github.com/NVIDIA/TransformerEngine.git && \ 81 | >>> cd TransformerEngine && \ 82 | >>> git fetch origin da30634a6c9ccdbb6c587b6c93b1860e4b038204 && \ 83 | >>> git checkout FETCH_HEAD && \ 84 | >>> git submodule init && git submodule update && \ 85 | >>> NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install . 86 |

ERROR: failed to solve: process "/bin/sh -c git clone https://github.com/NVIDIA/TransformerEngine.git && cd TransformerEngine && git fetch origin da30634a6c9ccdbb6c587b6c93b1860e4b038204 && git checkout FETCH_H

I'm really just trying to run this command and get all the models available to run the training. https://docs.nvidia.com/nemo-framework/user-guide/latest/modelalignment/draftp.html#model-aligner-draftp

I need: GPFS="/path/to/nemo-aligner-repo" (this repo) TRAIN_DATA_PATH="/path/to/train_dataset.tar" (I have this) UNET_CKPT="/path/to/unet_weights.ckpt" (This is easy to get) VAE_CKPT="/path/to/vae_weights.bin" (This is easy too) RM_CKPT="/path/to/reward_model.nemo" (Where is this?)

I think the Reward model can be found here: https://huggingface.co/yuvalkirstain/PickScore_v1 but I'm having a hard time converting this to .nemo

So if this is already somewhere that I can use, that would be great.

JRD971000 commented 6 months ago

@rundiffusion Regarding the Pickscore reward model conversion to nemo, you can do that using the conversion script mentioned in the tutorial as follows:

python /PATH_TO_NEMO_GITHUB_REPO/examples/multimodal/vision_language_foundation/clip/convert_external_clip_to_nemo.py --hparams_file /PATH_TO_NEMO_GITHUB_REPO/examples/multimodal/vision_language_foundation/clip/conf/megatron_clip_VIT-H-14.yaml --arch "yuvalkirstain/PickScore_v1" --nemo_file_path=/PATH_TO_SAVED_CKPT/pickscore.nemo

I will provide more updates regarding the container soon

JRD971000 commented 6 months ago

@rundiffusion Are you building the container from the NeMo-Aligner main?

rundiffusion commented 6 months ago

@JRD971000 Yes I've tried that script and had to change the mappings to get it to move past the conversion step then it failed somewhere else. It was one of those days where nothing seemed to work and I was 10 hours into it and had to step away. So I don't know what I tied and what exactly the issues were.

I can try and go back to it and get specifics. I was having trouble with Docker, NeMo-Aligner, NeMo base repo, the conversion script, and just had to step away and try and see if this was all worth it. Let's talk on LinkedIn really quick and see if we can get a plan before we troubleshoot my env. Of course it's something I'm doing on my end that is causing issues.

JRD971000 commented 6 months ago

@rundiffusion I found the issue, since DRAFT+ is not officially in the nemo container, there are some changes required to the docker file. I have added necessary changes to the docker file and pushed it to my branch. Please try this branch for building the container and running the conversion script:

python /PATH_TO_NEMO_GITHUB_REPO/examples/multimodal/vision_language_foundation/clip/convert_external_clip_to_nemo.py --hparams_file /PATH_TO_NEMO_GITHUB_REPO/examples/multimodal/vision_language_foundation/clip/conf/megatron_clip_VIT-H-14.yaml --arch "yuvalkirstain/PickScore_v1" --nemo_file_path=/PATH_TO_SAVED_CKPT/pickscore.nemo

Please let me know if everything works as expected, and we would appreciate your feedback on DRAFT+!

P.S. We mentioned in the blog post that DRAFT+ will be in nemo container soon, but that was easy to miss 😀. The finalized version of the container will be out in a couple of weeks!

NVIDIA / NeMo-Aligner

Docker build failing. Also, is there a .nemo reward model file available? #167

Dockerfile:80