[For your information] Ways to build environment and run openrlhf codes on a slurm cluster

glorgao commented 3 months ago

I am a Slurm cluster user.

I found it straightforward to build a conda environment and run the code on a local machine using the provided build_openrlhf.sh script. Well done!

However, the build_openrlhf.sh script doesn't work if you are a slurm cluster user. You may encounter specific errors when running the sft code. Here, I provide my solutions for these errors:

ImportError: /home/user/anaconda3/envs/rlhf/lib/python3.10/site-packages/flash_attn_2_cuda.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN2at4_ops5zeros4callEN3c108ArrayRefINS2_6SymIntEEENS2_8optionalINS2_10ScalarTypeEEENS6_INS2_6LayoutEEENS6_INS2_6DeviceEEENS6_IbEE

This error occurs both on my local machine and on my slurm cluster.

Solution 1: Uninstall flash-attn and reinstall it using the commands below:

pip3 uninstall flash-attn
FLASH_ATTENTION_FORCE_BUILD=TRUE pip3 install flash-attn==2.5.0

Solution 2: Reinstall flash-attn with the corresponding cxx11abiFALSE version manually. You can find it on this page: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.4.2. This solution is provided by https://github.com/Dao-AILab/flash-attention/issues/451

I opted for solution 1, but I recommend others to choose solution 2 to exactly follow the setup of the openrlhf project.

ninja: build stopped: subcommand failed. Traceback (most recent call last): File "/home/user/anaconda3/envs/openrlhf/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 2096, in _run_ninja_build subprocess.run( File "/home/user/anaconda3/envs/openrlhf/lib/python3.10/subprocess.py", line 526, in run raise CalledProcessError(retcode, process.args, subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1. ... RuntimeError: Error building extension 'fused_adam' ImportError: /home/user/.cache/torch_extensions/py310_cu118/fused_adam/fused_adam.so: cannot open shared object file: No such file or directory

This error occurs exclusively on my Slurm cluster. At first glance, it seems to stem from the ninja package. However, after several attempts, it became clear that pip3 was the actual culprit. I strongly advise slurm users not to install requirements using the provided build_openrlhf.sh script. This script uses pip3 which, in most cases, is unstable and not recommended. A more reliable alternative is to install the required packages using conda. Included here is my yaml file, which has allowed me to successfully run the code on my slurm cluster.

name: rlhf
channels:
  - huggingface
  - pytorch
  - nvidia/label/cuda-11.8.0
  - defaults
  - conda-forge
dependencies:
  - python = 3.10
  - pip

  - bitsandbytes
  - sentencepiece

  - pytorch::pytorch >= 2.0
  - pytorch::pytorch-mutex =*=*cuda*
  - datasets
  - tokenizers >= 0.13.3
  - einops
  - isort
  - jsonlines
  - loralib
  - optimum
  - wandb
  - packaging
  - peft
  - torchmetrics
  - tqdm
  - transformers==4.38.2
  - wheel
  - nvidia/label/cuda-11.8.0::cuda-toolkit = 11.8
  - pip:
      - accelerate
      - deepspeed==0.13.2
      - flash-attn==2.4.2
      - ray[default]
      - transformers_stream_generator

Hope the aboved information could save your time. Cheers and best wishes in your endeavors!

mickel-liu commented 3 months ago

I suspect this is due to incompatible CUDA compiled versions and cxx11 abi compatibility.

If you are using CUDA 11.8 (torch 2.1, python 3.10), you are better off manually installing from the source:

pip3 install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

If you get this bug ImportError: libcudart.so.12: cannot open shared object file: No such file or directory, it is likely that you installed with pip3 install flash-attn, which I believe by default is compiled with CUDA 12.1

If you are still getting exceptions on undefined symbols but you are sure that flash-attn has the correct CUDA compiled version (which in my case is 11.8), try either cxx11abiFALSE or cxx11abiTRUE. In my case, I got the exception when I installed the cxx11abiTRUE version but got it working when I installed the cxx11abiFALSE version.

catqaq commented 2 months ago

I suspect this is due to incompatible CUDA compiled versions and cxx11 abi compatibility.

If you are using CUDA 11.8 (torch 2.1, python 3.10), you are better off manually installing from the source:

pip3 install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

If you get this bug ImportError: libcudart.so.12: cannot open shared object file: No such file or directory, it is likely that you installed with pip3 install flash-attn, which I believe by default is compiled with CUDA 12.1

If you are still getting exceptions on undefined symbols but you are sure that flash-attn has the correct CUDA compiled version (which in my case is 11.8), try either cxx11abiFALSE or cxx11abiTRUE. In my case, I got the exception when I installed the cxx11abiTRUE version but got it working when I installed the cxx11abiFALSE version.

yep, manually installing flash-attn usually solves the problem

OpenLLMAI / OpenRLHF

[For your information] Ways to build environment and run openrlhf codes on a slurm cluster #251