Closed glorgao closed 2 months ago
I suspect this is due to incompatible CUDA compiled versions and cxx11 abi compatibility.
If you are using CUDA 11.8 (torch 2.1, python 3.10), you are better off manually installing from the source:
pip3 install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
If you get this bug ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
, it is likely that you installed with pip3 install flash-attn
, which I believe by default is compiled with CUDA 12.1
If you are still getting exceptions on undefined symbols
but you are sure that flash-attn
has the correct CUDA compiled version (which in my case is 11.8), try either cxx11abiFALSE
or cxx11abiTRUE
. In my case, I got the exception when I installed the cxx11abiTRUE
version but got it working when I installed the cxx11abiFALSE
version.
I suspect this is due to incompatible CUDA compiled versions and cxx11 abi compatibility.
If you are using CUDA 11.8 (torch 2.1, python 3.10), you are better off manually installing from the source:
pip3 install https://github.com/Dao-AILab/flash-attention/releases/download/v2.5.6/flash_attn-2.5.6+cu118torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
If you get this bug
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
, it is likely that you installed withpip3 install flash-attn
, which I believe by default is compiled with CUDA 12.1If you are still getting exceptions on
undefined symbols
but you are sure thatflash-attn
has the correct CUDA compiled version (which in my case is 11.8), try eithercxx11abiFALSE
orcxx11abiTRUE
. In my case, I got the exception when I installed thecxx11abiTRUE
version but got it working when I installed thecxx11abiFALSE
version.
yep, manually installing flash-attn usually solves the problem
I am a Slurm cluster user.
I found it straightforward to build a conda environment and run the code on a local machine using the provided
build_openrlhf.sh
script. Well done!However, the
build_openrlhf.sh
script doesn't work if you are a slurm cluster user. You may encounter specific errors when running the sft code. Here, I provide my solutions for these errors:This error occurs both on my local machine and on my slurm cluster.
Solution 1: Uninstall flash-attn and reinstall it using the commands below:
Solution 2: Reinstall flash-attn with the corresponding
cxx11abiFALSE
version manually. You can find it on this page: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.4.2. This solution is provided by https://github.com/Dao-AILab/flash-attention/issues/451I opted for solution 1, but I recommend others to choose solution 2 to exactly follow the setup of the openrlhf project.
This error occurs exclusively on my Slurm cluster. At first glance, it seems to stem from the
ninja
package. However, after several attempts, it became clear thatpip3
was the actual culprit. I strongly adviseslurm
users not to install requirements using the providedbuild_openrlhf.sh
script. This script usespip3
which, in most cases, is unstable and not recommended. A more reliable alternative is to install the required packages usingconda
. Included here is myyaml
file, which has allowed me to successfully run the code on my slurm cluster.Hope the aboved information could save your time. Cheers and best wishes in your endeavors!