ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
17 stars 14 forks source link

Problems building apex with ROCm-5.4, 5.5, and 5.6 #115

Open adammoody opened 1 year ago

adammoody commented 1 year ago

Describe the Bug The latest master branch fails to build with several ROCm versions, including 5.4, 5.5, and 5.6.

Rolling back to the commit made on June 20 (git checkout 10c7482) allows ROCm-5.4 to build. The build still fails for 5.5 and 5.6 but with a different error.

Minimal Steps/Code to Reproduce the Bug

For ROCm-5.4.3, I use the following to build: ``` virtualenv --system-site-packages env source env/bin/activate pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm5.4.2 git clone --recursive https://github.com/ROCmSoftwarePlatform/apex.git cd apex export DISTUTILS_DEBUG=1 export __HIP_PLATFORM_HCC__ export __HIP_PLATFORM_AMD__ export HCC_AMDGPU_TARGET=gfx90a export PYTORCH_ROCM_ARCH=gfx90a export ROCM_HOME=/opt/rocm-5.4.3 export CC=gcc export CXX=g++ pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ ``` The build fails when compiling ``csrc/mlp_hip.hip`` with errors like the following: ``` csrc/mlp_hip.hip:65:53: error: unknown type name 'hipblasOperation_t'; did you mean 'hipsparseOperation_t'? static rocblas_operation hipOperationToRocOperation(hipblasOperation_t op) ^~~~~~~~~~~~~~~~~~ hipsparseOperation_t /opt/rocm-5.4.3/include/hipsparse/hipsparse.h:317:3: note: 'hipsparseOperation_t' declared here } hipsparseOperation_t; ^ csrc/mlp_hip.hip:69:10: error: use of undeclared identifier 'HIPBLAS_OP_N' case HIPBLAS_OP_N: ^ csrc/mlp_hip.hip:71:10: error: use of undeclared identifier 'HIPBLAS_OP_T' case HIPBLAS_OP_T: ^ csrc/mlp_hip.hip:73:10: error: use of undeclared identifier 'HIPBLAS_OP_C' case HIPBLAS_OP_C: ^ csrc/mlp_hip.hip:79:8: error: unknown type name 'hipblasStatus_t'; did you mean 'hipsparseStatus_t'? static hipblasStatus_t rocBLASStatusToHIPStatus(rocblas_status error) ^~~~~~~~~~~~~~~ hipsparseStatus_t /opt/rocm-5.4.3/include/hipsparse/hipsparse.h:188:3: note: 'hipsparseStatus_t' declared here } hipsparseStatus_t; ^ ``` Rolling back to the commit from June 20 allows the build to complete: ``` cd apex git checkout 10c7482 git submodule init git submodule update export DISTUTILS_DEBUG=1 export __HIP_PLATFORM_HCC__ export __HIP_PLATFORM_AMD__ export HCC_AMDGPU_TARGET=gfx90a export PYTORCH_ROCM_ARCH=gfx90a export ROCM_HOME=/opt/rocm-5.4.3 export CC=gcc export CXX=g++ pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ ``` Building apex from ``master`` with ROCm-5.5 and ROCm-5.6 fail with errors similar to each other, but errors that are distinct from ROCm-5.4. Here are the steps I used to build with ROCm-5.6: ``` virtualenv --system-site-packages env source env/bin/activate pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.6 git clone --recursive https://github.com/ROCmSoftwarePlatform/apex.git cd apex export DISTUTILS_DEBUG=1 export __HIP_PLATFORM_HCC__ export __HIP_PLATFORM_AMD__ export HCC_AMDGPU_TARGET=gfx90a export PYTORCH_ROCM_ARCH=gfx90a export ROCM_HOME=/opt/rocm-5.6.0 export CC=gcc export CXX=g++ pip3 install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./ ``` That fails with the following error: ``` csrc/mlp_hip.hip:91:10: error: use of undeclared identifier 'rocblas_status_excluded_from_build' case rocblas_status_excluded_from_build: ^ csrc/mlp_hip.hip:104:10: error: use of undeclared identifier 'rocblas_status_arch_mismatch'; did you mean 'rocblas_status_size_query_mismatch'? case rocblas_status_arch_mismatch: ^~~~~~~~~~~~~~~~~~~~~~~~~~~~ rocblas_status_size_query_mismatch /opt/rocm-5.6.0/include/rocblas/internal/rocblas-types.h:212:5: note: 'rocblas_status_size_query_mismatch' declared here rocblas_status_size_query_mismatch = 8, /**< Unmatched start/stop size query */ ^ csrc/mlp_hip.hip:104:10: error: duplicate case value 'rocblas_status_size_query_mismatch' case rocblas_status_arch_mismatch: ^ csrc/mlp_hip.hip:96:10: note: previous case defined here case rocblas_status_size_query_mismatch: ^ ``` In this case, rolling back to the June 20 commit fails with a different error: ``` csrc/mlp_hip.hip:89:7: error: use of undeclared identifier 'rocblas_datatype_f64_r' rocblas_datatype_f64_r, ^ csrc/mlp_hip.hip:92:7: error: use of undeclared identifier 'rocblas_datatype_f64_r' rocblas_datatype_f64_r, ^ csrc/mlp_hip.hip:96:7: error: use of undeclared identifier 'rocblas_datatype_f64_r' rocblas_datatype_f64_r, ^ csrc/mlp_hip.hip:99:7: error: use of undeclared identifier 'rocblas_datatype_f64_r' rocblas_datatype_f64_r, ^ csrc/mlp_hip.hip:101:7: error: use of undeclared identifier 'rocblas_datatype_f64_r' rocblas_datatype_f64_r, ^ csrc/mlp_hip.hip:102:7: error: use of undeclared identifier 'rocblas_gemm_algo_standard' rocblas_gemm_algo_standard, ^ ``` Building with the June 20 commit, I see that the ``csrc/mlp_hip.hip`` file contains the following for ROCm-5.5 and ROCm-5.6 (which fails): ``` /* Includes, cuda */ #include #include ``` but it has the following for ROCm-5.4 (which builds): ``` /* Includes, cuda */ #include #include ``` **Expected Behavior**

Environment

loadams commented 1 year ago

I'm seeing this as well, a number of errors like those above while building the cuda_ext.

/apex/csrc/mlp_hip.hip:65:53: error: unknown type name 'hipblasOperation_t'; did you mean 'hipsparseOperation_t'?
  static rocblas_operation hipOperationToRocOperation(hipblasOperation_t op)
loadams commented 1 year ago

FYI @jithunnair-amd

hliuca commented 12 months ago

Hi @adammoody and @loadams, if you are using PyTorch 2.0 or earlier, please use master branch for apex. If you are using PyTorch 2.1+, please use torch_2.1_higher branch.

There are some changes related to CUDA to HIP conversion in PyTorch.

export HIP_PLATFORM_HCC export HIP_PLATFORM_AMD these two commands are not needed.

I am not apex developer.