ROCm / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
18 stars 15 forks source link

Fatal error: 'cuda_runtime_api.h' file not found #102

Open lvcc2018 opened 1 year ago

lvcc2018 commented 1 year ago

The following error message occurs when I install the apex from source on my ROCm server(CentOS 7.6).

python setup.py install --cpp_ext --cuda_ext
In file included from /public/home/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_adam_kernel.hip:12:
In file included from /public/home/apex-master/csrc/multi_tensor_apply.cuh:3:
/public/home/.conda/envs/my_env/lib/python3.8/site-packages/torch/include/ATen/cuda/CUDAContext.h:5:10: fatal error: 'cuda_runtime_api.h' file not found
#include <cuda_runtime_api.h>
         ^~~~~~~~~~~~~~~~~~~~
1 error generated when compiling for gfx803.
error: command '/public/software/compiler/rocm/rocm-4.0.1/bin/hipcc' failed with exit code 1

It seems that it is building 'distributed_adam_cuda' extension.

My envirments:

Currently Loaded Modulefiles:
  1) compiler/devtoolset/7.3.1   2) compiler/rocm/4.0.1         3) mpi/hpcx/2.4.1/gcc-7.3.1
python                    3.8.13 
torch                     1.10.1+rocm4.0.1
lvcc2018 commented 1 year ago

Is nvcc or any other dependencies necessary?

hubertlu-tw commented 1 year ago

Hi @lvcc2018, It seems that some files were not "hipified" properly. I recommend you to use the following options:

  1. (Preferred) Use our published docker image: docker pull rocm/pytorch:latest-centos7 which has the latest prebuilt stable PyTorch and Apex. If you would like to make some changes to Apex, feel free to reinstall Apex from source.
  2. Uninstall ROCm 4.0.1 and reinstall newer versions of ROCm and their dependencies. Then, build ROCm from source.
lvcc2018 commented 1 year ago

Hi @lvcc2018, It seems that some files were not "hipified" properly. I recommend you to use the following options:

  1. (Preferred) Use our published docker image: docker pull rocm/pytorch:latest-centos7 which has the latest prebuilt stable PyTorch and Apex. If you would like to make some changes to Apex, feel free to reinstall Apex from source.
  2. Uninstall ROCm 4.0.1 and reinstall newer versions of ROCm and their dependencies. Then, build ROCm from source.

Thanks for your time. Agree that it doesn't hipify properly. Unfortunately It's not allowed to use docker or use a newer version of ROCm. I find that all the .cu files are skipped, is it normal?

/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/multihead_attn/dropout_hip.cuh -> None ignored
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/multihead_attn/layer_norm_hip.cuh -> None ignored
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/multihead_attn/strided_batched_gemm_hip.cuh -> None ignored
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/nccl_p2p/nccl_p2p_cuda.cuh -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/nccl_p2p/nccl_p2p_cuda.cuh ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/nccl_p2p/nccl_p2p.cpp -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/nccl_p2p/nccl_p2p.cpp ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/nccl_p2p/nccl_p2p_cuda.cu -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/nccl_p2p/nccl_p2p_hip.hip skipped
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/fused_adam_cuda.cpp -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/fused_adam_cuda.cpp ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/fused_adam_cuda_kernel.cu -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/fused_adam_hip_kernel.hip skipped
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/fused_lamb_cuda.cpp -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/fused_lamb_cuda.cpp ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/fused_lamb_cuda_kernel.cu -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/fused_lamb_hip_kernel.hip skipped
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_adam.cpp -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_adam.cpp ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_adam_kernel.cu -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_adam_kernel.hip skipped
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_lamb.cpp -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_lamb.cpp ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_lamb_kernel.cu -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/optimizers/multi_tensor_distopt_lamb_kernel.hip skipped
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/transducer/transducer_joint_kernel.cu -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/transducer/transducer_joint_kernel.hip skipped
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/transducer/transducer_joint.cpp -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/transducer/transducer_joint.cpp ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/transducer/transducer_loss.cpp -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/transducer/transducer_loss.cpp ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/transducer/transducer_loss_kernel.cu -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/transducer/transducer_loss_kernel.hip skipped
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/xentropy/interface.cpp -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/xentropy/interface.cpp ok
/public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/xentropy/xentropy_kernel.cu -> /public/home/ach2ha8oau/megatron-deepspeed/apex-master/apex/contrib/csrc/xentropy/xentropy_kernel.hip skipped
Successfully preprocessed all matching files.
Total number of unsupported CUDA function calls: 0
jeffdaily commented 1 year ago

Which version of pytorch do you have installed?

lvcc2018 commented 1 year ago

Which version of pytorch do you have installed?

Its 1.10.1+rocm4.0.1