Open netw0rkf10w opened 8 months ago
Because they do not work in a develop branch, we end up getting an unfinished version when pulling the master. The officially supported pytorch<->rocm apex version are: | APEX Version |
APEX branch |
Torch Version |
---|---|---|---|
1.3.0 | master | 2.3 |
|||
1.2.0 | release/1.2.0 | 2.2 |
|||
1.1.0 | release/1.1.0 | 2.1 |
|||
1.0.0 | release/1.0.0 | 2.0 and older |
So be sure to do some kind of git clone -b release/1.2.0 ..
if you are working with torch 2.2 which relies on ROCm 5.7.
Also, a fix landed in d89d4bd - Correct the CUBLAS_COMPUTE type. You can find the hipify supported conversions (cuda->hip) here: https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUBLAS_API_supported_by_HIP.html
@etiennemlb Thanks a lot! As you can see, I already had git checkout release/1.2.0
in my commands, so I was using the right branch.
I'm going to try again with https://github.com/ROCm/apex/commit/d89d4bd2bf15abda83113f2d0c846ff01fd90567.
Also, a fix landed in d89d4bd - Correct the CUBLAS_COMPUTE type. You can find the hipify supported conversions (cuda->hip) here: https://rocm.docs.amd.com/projects/HIPIFY/en/latest/tables/CUBLAS_API_supported_by_HIP.html
git checkout d89d4bd
works for me! Thank you so much @etiennemlb @netw0rkf10w !
@formiel good news, but still, I do not understand why the recipe given here: https://dci.dci-gitlab.cines.fr/webextranet/software_stack/libraries/index.html#apex
does not work for you. Of course, you need to adapt the rocm version to the apex version and to the torch version all these are interlinked.
@etiennemlb The recipe in the link you sent works for me with PyTorch 2.4.1 and ROCm 6.0.0, but not with PyTorch 2.2.2 and ROCm 5.7. It's strange, though, because I didn’t encounter this error before, even when installing in another virtual environment with the same PyTorch 2.2.2 and ROCm 5.7 setup. I'm not sure what has changed, but I’m glad the installation is successful now with the commit you shared. Thanks again for your help!
Describe the Bug
When installing on an MI250x server with ROCm 5.7 and PyTorch 2.2.1, I obtained the following errors:
Minimal Steps/Code to Reproduce the Bug
```bash git clone https://github.com/ROCm/apex.git cd apex git checkout release/1.2.0 export HCC_AMDGPU_TARGET=gfx90a export PYTORCH_ROCM_ARCH=gfx90a export GPU_ARCHS=gfx90a export MAX_JOBS=8 pip install -v --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ --user ``` **Expected Behavior**Environment