ROCm / rocPRIM

ROCm Parallel Primitives
https://rocm.docs.amd.com/projects/rocPRIM/
MIT License
160 stars 69 forks source link

error: invalid operands to binary expression #311

Closed throwm8 closed 2 years ago

throwm8 commented 2 years ago

Describe the bug I'm trying to build pytorch with rocm support but I'm getting an error related to rocprim, specifically the file device_scan.hpp

In file included from /media/nvme/scratch/yay/python-pytorch-rocm/src/pytorch-1.10.2-rocm/aten/src/ATen/native/hip/IndexKernel.hip:13:
In file included from /media/nvme/scratch/yay/python-pytorch-rocm/src/pytorch-1.10.2-rocm/aten/src/ATen/hip/cub.cuh:26:
In file included from /opt/rocm/include/hipcub/hipcub.hpp:36:
In file included from /opt/rocm/include/hipcub/backend/rocprim/hipcub.hpp:77:
In file included from /opt/rocm/include/hipcub/backend/rocprim/device/device_run_length_encode.hpp:35:
In file included from /opt/rocm/include/rocprim/device/device_run_length_encode.hpp:37:
In file included from /opt/rocm/include/rocprim/device/device_select.hpp:33:
/opt/rocm/include/rocprim/device/device_scan.hpp:531:27: error: invalid operands to binary expression ('at::cuda::cub::impl::chained_iterator<long, unsigned char *>' and 'size_t' (aka 'unsigned long'))
                    input + offset, output + offset, current_size, initial_value,
                    ~~~~~ ^ ~~~~~~
/opt/rocm/hip/include/hip/amd_detail/amd_hip_runtime.h:270:87: note: expanded from macro 'hipLaunchKernelGGL'
#define hipLaunchKernelGGL(kernelName, ...)  hipLaunchKernelGGLInternal((kernelName), __VA_ARGS__)
                                                                                      ^~~~~~~~~~~
/opt/rocm/hip/include/hip/amd_detail/amd_hip_runtime.h:267:78: note: expanded from macro 'hipLaunchKernelGGLInternal'
        kernelName<<<(numBlocks), (numThreads), (memPerBlock), (streamId)>>>(__VA_ARGS__);         \
                                                                             ^~~~~~~~~~~

To Reproduce Trying to build pytorch with rocm support using the variable PYTORCH_ROCM_ARCH=gfx1030 should trigger this issue.

Expected behavior Pytorch should build without any errors

Environment environment.txt is attached.

Thanks. environment.txt

Maetveis commented 2 years ago

Arch linux is not an officially supported platform please open an issues at https://github.com/rocm-arch/rocm-arch/issues.

But I think https://github.com/rocm-arch/rocm-arch/blob/master/rocm-core/PKGBUILD is the culprit because it is still at 4.5.2.

throwm8 commented 2 years ago

I manually modified rocm-core PKGBUILD and changed the version to 5.0.0 before trying to compile pytorch, it does seem to detect rocm successfully so there might be another cause.

***** ROCm version from /opt/rocm/.info/version-dev ****

ROCM_VERSION_DEV: 5.0.0}
ROCM_VERSION_DEV_MAJOR: 5
ROCM_VERSION_DEV_MINOR: 0
ROCM_VERSION_DEV_PATCH: 0}
ROCM_VERSION_DEV_INT:   50000
HIP_VERSION_MAJOR: 5
HIP_VERSION_MINOR: 0
TORCH_HIP_VERSION: 500

***** Library versions from dpkg *****

***** Library versions from cmake find_package *****

-- hip::amdhip64 is SHARED_LIBRARY
hip VERSION: 5.0.22066
hsa-runtime64 VERSION: 1.5.0
amd_comgr VERSION: 2.4.0
rocrand VERSION: 2.10.9
hiprand VERSION: 2.10.9
-- hip::amdhip64 is SHARED_LIBRARY
rocblas VERSION: 2.42.0
-- hip::amdhip64 is SHARED_LIBRARY
miopen VERSION: 2.14.0
-- hip::amdhip64 is SHARED_LIBRARY
hipfft VERSION: 1.0.5
-- hip::amdhip64 is SHARED_LIBRARY
hipsparse VERSION: 1.11.2
-- hip::amdhip64 is SHARED_LIBRARY
rccl VERSION: 2.10.3
-- hip::amdhip64 is SHARED_LIBRARY
rocprim VERSION: 2.10.9
-- hip::amdhip64 is SHARED_LIBRARY
hipcub VERSION: 2.10.12
-- hip::amdhip64 is SHARED_LIBRARY
rocthrust VERSION: 2.10.9
ROCm version >= 4.1; enabling asserts
HIP library name: amdhip64
ROCm is enabled.

I know Arch is not officially supported but is it normal for GCC to complain about a library file like this?

Maetveis commented 2 years ago

https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/cuda/cub.cuh#L195-L203

The error that the compiler gives is coming from a path that is supposed to be disabled on ROCM above 5.0. I can't say if the version detection or something else is going astray, but you can try debugging why the preprocessor is not happy.

I know Arch is not officially supported but is it normal for GCC to complain about a library file like this?

I don't know exactly what you mean here, but to compile HIP source code you must use hipcc or clang from /opt/rocm/llvm/bin.

throwm8 commented 2 years ago

The error that the compiler gives is coming from a path that is supposed to be disabled on ROCM above 5.0.

I see, then it probably has something to with my system being misconfigured in some way like you suggested. This is the cmake configure output before the error.

-- ******** Summary ********
-- General:
--   CMake version         : 3.22.2
--   CMake command         : /usr/bin/cmake
--   System                : Linux
--   C++ compiler          : /usr/bin/c++
--   C++ compiler id       : GNU
--   C++ compiler version  : 11.2.1
--   Using ccache if found : ON
--   Found ccache          : CCACHE_PROGRAM-NOTFOUND
--   CXX flags             : -march=znver2 -mtune=znver2 -O2 -pipe -fno-plt -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -DEDGE_PROFILER_USE_KINETO -O2 -fPIC -Wno-narrowing -Wall -Wextra -Werror=return-type -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-unused-local-typedefs -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow
--   Build type            : Release
--   Compile definitions   : TH_BLAS_MKL;ROCM_VERSION=50000;TORCH_HIP_VERSION=500;ONNX_ML=1;ONNXIFI_ENABLE_EXT=1;ONNX_NAMESPACE=onnx_torch;IDEEP_USE_MKL;HAVE_MMAP=1;_FILE_OFFSET_BITS=64;HAVE_SHM_OPEN=1;HAVE_SHM_UNLINK=1;HAVE_MALLOC_USABLE_SIZE=1;USE_EXTERNAL_MZCRC;MINIZ_DISABLE_ZIP_READER_CRC32_CHECKS
--   CMAKE_PREFIX_PATH     : /usr/lib/python3.10/site-packages
--   CMAKE_INSTALL_PREFIX  : /media/nvme/scratch/yay/python-pytorch-rocm/src/pytorch-1.10.2-rocm/torch
--   USE_GOLD_LINKER       : OFF
-- 
--   TORCH_VERSION         : 1.10.2
--   CAFFE2_VERSION        : 1.10.2
--   BUILD_CAFFE2          : ON
--   BUILD_CAFFE2_OPS      : ON
--   BUILD_CAFFE2_MOBILE   : OFF
--   BUILD_STATIC_RUNTIME_BENCHMARK: OFF
--   BUILD_TENSOREXPR_BENCHMARK: OFF
--   BUILD_BINARY          : ON
--   BUILD_CUSTOM_PROTOBUF : OFF
--     Protobuf compiler   : /usr/bin/protoc
--     Protobuf includes   : /usr/include
--     Protobuf libraries  : /usr/lib/libprotobuf.so
--   BUILD_DOCS            : OFF
--   BUILD_PYTHON          : True
--     Python version      : 3.10.2
--     Python executable   : /usr/bin/python
--     Pythonlibs version  : 3.10.2
--     Python library      : /usr/lib/libpython3.10.so.1.0
--     Python includes     : /usr/include/python3.10
--     Python site-packages: lib/python3.10/site-packages
--   BUILD_SHARED_LIBS     : ON
--   CAFFE2_USE_MSVC_STATIC_RUNTIME     : OFF
--   BUILD_TEST            : True
--   BUILD_JNI             : OFF
--   BUILD_MOBILE_AUTOGRAD : OFF
--   BUILD_LITE_INTERPRETER: OFF
--   INTERN_BUILD_MOBILE   : 
--   USE_BLAS              : 1
--     BLAS                : mkl
--   USE_LAPACK            : 1
--     LAPACK              : mkl
--   USE_ASAN              : OFF
--   USE_CPP_CODE_COVERAGE : OFF
--   USE_CUDA              : 0
--   USE_ROCM              : ON
--   USE_EIGEN_FOR_BLAS    : 
--   USE_FBGEMM            : ON
--     USE_FAKELOWP          : OFF
--   USE_KINETO            : ON
--   USE_FFMPEG            : ON
--   USE_GFLAGS            : ON
--   USE_GLOG              : ON
--   USE_LEVELDB           : OFF
--   USE_LITE_PROTO        : OFF
--   USE_LMDB              : OFF
--   USE_METAL             : OFF
--   USE_PYTORCH_METAL     : OFF
--   USE_PYTORCH_METAL_EXPORT     : OFF
--   USE_FFTW              : OFF
--   USE_MKL               : ON
--   USE_MKLDNN            : ON
--   USE_MKLDNN_ACL        : OFF
--   USE_MKLDNN_CBLAS      : OFF
--   USE_NCCL              : ON
--     USE_SYSTEM_NCCL     : ON
--   USE_NNPACK            : ON
--   USE_NUMPY             : ON
--   USE_OBSERVERS         : ON
--   USE_OPENCL            : OFF
--   USE_OPENCV            : ON
--     OpenCV version      : 4.5.5
--   USE_OPENMP            : ON
--   USE_TBB               : OFF
--   USE_VULKAN            : OFF
--   USE_PROF              : OFF
--   USE_QNNPACK           : ON
--   USE_PYTORCH_QNNPACK   : ON
--   USE_REDIS             : OFF
--   USE_ROCKSDB           : OFF
--   USE_ZMQ               : OFF
--   USE_DISTRIBUTED       : ON
--     USE_MPI               : ON
--     USE_GLOO              : ON
--     USE_GLOO_WITH_OPENSSL : OFF
--     USE_TENSORPIPE        : ON
--   USE_DEPLOY           : OFF
--   USE_BREAKPAD         : ON
--   Public Dependencies  : Threads::Threads;caffe2::mkl;glog::glog;caffe2::mkldnn
--   Private Dependencies : pthreadpool;cpuinfo;qnnpack;pytorch_qnnpack;nnpack;XNNPACK;fbgemm;/usr/lib/libnuma.so;opencv_core;opencv_highgui;opencv_imgproc;opencv_imgcodecs;opencv_optflow;opencv_videoio;opencv_video;/usr/lib/libavcodec.so;/usr/lib/libavformat.so;/usr/lib/libavutil.so;/usr/lib/libswscale.so;/usr/lib/libswresample.so;fp16;/usr/lib/openmpi/libmpi_cxx.so;/usr/lib/openmpi/libmpi.so;gloo;tensorpipe;aten_op_header_gen;foxi_loader;rt;fmt::fmt-header-only;kineto;gcc_s;gcc;dl
--   USE_COREML_DELEGATE     : OFF
-- Configuring done
-- Generating done

USE_ROCM is set and ROCM_VERSION equals 50000 so I don't really have an idea what is going on, admittedly I'm not experienced at all with rocm or pytorch in general but is it normal that cmake is using G++ instead of clang in the log above? Is there anything you can recommend that can help debug the issue?

I don't know exactly what you mean here, but to compile HIP source code you must use hipcc or clang from /opt/rocm/llvm/bin.

I was trying to say that the compiler is complaining about a function in a header file provided by rocprim and not how it's being used in pytorch(that's what I understood) which is why I thought that the problem might be related to rocprim.

Maetveis commented 2 years ago

it normal that cmake is using G++ instead of clang in the log above

I'm not familiar with pytorch but I think it should be hipcc or clang (from the amd llvm repo). You should try setting CXX in the environment to hipcc or /opt/rocm/bin/hipcc if hipcc is not on the PATH.

Is there anything you can recommend that can help debug the issue?

Other than changing the compiler, no, maybe post this to pytorch. EDIT: Are you compiling the latest release or are you building from master? What I said about the code path being disabled above ROCM 5 only applies to master, it is not in the v1.10.2 Release. This is the pull request that added it: https://github.com/pytorch/pytorch/pull/68487.

the compiler is complaining about a function in a header file provided by rocprim and not how it's being used in pytorch

Unfortunately this is quite common in templated c++ code. You should try looking for messages like: "Note: required from ..." and "Note: instantiated by: ..." for where to function was called from.

throwm8 commented 2 years ago

Thanks for the hints, I tried setting CXX to hipcc but that caused a build failure as well. From what I can tell GCC is used for CPU related parts and hipcc/rocm's clang is being used for the GPU relevant parts. That probably wasn't the problem.

Anyway I did some digging and the piece of code you mentioned was missing in the files I have. I was trying to get rocm-arch's PKGBUILD for pytorch to work and it clones the repository but with the argument #tag=1.10.2. I cloned pytorch locally to check the commit's tag and for some reason it didn't have one. I then built pytorch manually with rocm support enabled and it didn't have any errors. Thank you very much for your help, I wouldn't have been able to solve this issue of mine without it.

Since this problem doesn't seem to be caused by rocPRIM I'll close the issue.