ValeevGroup / tiledarray

A massively-parallel, block-sparse tensor framework written in C++
GNU General Public License v3.0
247 stars 51 forks source link

Is Tiledarray ready for SYCL conversion? #427

Open victor-anisimov opened 9 months ago

victor-anisimov commented 9 months ago

I'm trying to compile the latest commit d0c0dedf529dbc7af2d886ec3594b032f86ae6c4 of tiledarray for CUDA platform (V100) and get a compiler error:

[ 95%] Building CXX object src/CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o cd /home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/src && /soft/packaging/spack-builds/linux-opensuse_leap15-x86_64/gcc-10.2.0/gcc-10.2.0-yudlyezca7twgd5o3wkkraur7wdbngdn/bin/g++ -DBOOST_ALL_NO_LIB -DBOOST_SERIALIZATION_DYN_LINK -DBTAS_ASSERT_THROWS=1 -DBTAS_DEFAULT_TARGET_MAX_INDEX_RANK=6 -DBTAS_HAS_BLAS_LAPACK=1 -DBTAS_HAS_BOOST_CONTAINER=1 -DBTAS_HAS_BOOST_ITERATOR=1 -DBTAS_HAS_BOOST_SERIALIZATION=1 -DBTAS_HAS_CBLAS=1 -DBTAS_HAS_INTEL_MKL=1 -DBTAS_HAS_LAPACKE=1 -DLAPACK_COMPLEX_CPP=1 -DLIBRETT_USES_CUDA=1 -DMADNESS_DISABLE_WORLD_GET_DEFAULT=1 -DMADNESS_LINALG_USE_LAPACKE=1 -DMADNESS_MPI_HEADER=\"/soft/restricted/CNDA/mpich/drop51.2/mpich-ofi-sockets-icc-default-gpu-drop51/include/mpi.h\" -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX -D_MPICC_H -I/home/vanisimov/tiledarray/master/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/src -I/soft/compilers/cuda/cuda-12.0.0/targets/x86_64-linux/include -I/soft/restricted/CNDA/updates/2023.05.15.001/oneapi/mkl/2023u2_20230512/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-src/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-build/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-build/src/madness/world -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-src/src/madness/world -I/soft/restricted/CNDA/mpich/drop51.2/mpich-ofi-sockets-icc-default-gpu-drop51/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/eigen-src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/btas-src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/blaspp-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/blaspp-src/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/lapackpp-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/lapackpp-src/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src/tpl -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src/tpl/umpire/camp/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-build/src/tpl/umpire/camp/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/librett-src/src -isystem /soft/compilers/cuda/cuda-12.0.0/include -I/home/vanisimov/tiledarray/install/include -std=c++17 -D__TBB_LEGACY_MODE=1 -DTBB_SUPPRESS_DEPRECATED_MESSAGES=1 -I/home/vanisimov/tiledarray/install/include -O0 -g -Wall -ftemplate-backtrace-limit=0 -std=c++17 -fPIC -I/home/vanisimov/tiledarray/install/include -std=c++17 -D__TBB_LEGACY_MODE=1 -DTBB_SUPPRESS_DEPRECATED_MESSAGES=1 -fopenmp -MD -MT src/CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o -MF CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o.d -o CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o -c /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/utility.h:33, from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:30, from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:30, from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31, from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31, from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26: /home/vanisimov/tiledarray/master/src/TiledArray/tiled_range1.h: In static member function 'static TiledArray::TiledRange1 TiledArray::TiledRange1::make_uniform(std::size_t, std::size_t)': /home/vanisimov/tiledarray/master/src/TiledArray/tiled_range1.h:268:41: warning: comparison of integer expressions of different signedness: 'long int' and 'std::size_t' {aka 'long unsigned int'} [-Wsign-compare] 268 | for (auto i = num_avg_plus_one; i < ntiles; | ^~ In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31, from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31, from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26: /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h: In instantiation of 'TiledArray::Tensor<T, A>::Tensor(const T1&, const Perm&) [with T1 = TiledArray::Tensor; Perm = TiledArray::Permutation; typename std::enable_if<(is_nested_tensor_v && is_permutation_v)>::type = 0; T = float; Allocator = Eigen::aligned_allocator]': /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:1279:16: required from 'TiledArray::Tensor<T, A> TiledArray::Tensor<T, A>::permute(const Perm&) const [with Perm = TiledArray::Permutation; = void; T = float; Allocator = Eigen::aligned_allocator]' /home/vanisimov/tiledarray/master/src/TiledArray/sparseshape.h:1201:44: required from 'TiledArray::SparseShape::SparseShape TiledArray::SparseShape::perm(const TiledArray::Permutation&) const [with T = float; TiledArray::SparseShape::SparseShape_ = TiledArray::SparseShape]' /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:30:16: required from here /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:404:24: error: no matching function for call to 'tensor_init(, TiledArray::Permutation, TiledArray::Tensor&, const TiledArray::Tensor&)' 404 | detail::tensor_init(value_converter, outer(perm), | ~~~~~^~~~~~~~~~~~~ 405 | this, other); | ~~~~~ In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:30, from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31, from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31, from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26: /home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:449:13: note: candidate: 'template<class Op, class TR, class ... Ts, typename std::enable_if<(TiledArray::detail::is_tensor<T2, Ts ...>::value && TiledArray::detail::is_contiguous_tensor<T2, Ts ...>::value)>::type > void TiledArray::detail::tensor_init(Op&&, TR&, const Ts& ...)' 449 | inline void tensor_init(Op&& op, TR& result, const Ts&... tensors) { | ^~~ /home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:449:13: note: template argument deduction/substitution failed: In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31, from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31, from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26: /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:404:24: note: couldn't deduce template parameter 'Op' 404 | detail::tensor_init(value_converter, outer(perm), | ~~~~~^~~~~~~~~~~~~ 405 | this, other); | ~~~~~

evaleev commented 9 months ago

@victor-anisimov I think this is an issue due to a regression in gcc 10.2 ... do you have 10.3 available? That's the main compiler we use on our cluster, it should work.

victor-anisimov commented 9 months ago

gcc/11.1.0 worked for me. Thank you very much for the suggestion! However, I see many tests failing.

  Start  1: madness/test/world/build

1/26 Test #1: madness/test/world/build ................... Passed 53.98 sec Start 2: madness/test/world/test_prof/run 2/26 Test #2: madness/test/world/test_prof/run ...........Failed 1.19 sec Start 3: madness/test/world/test_ar/run 3/26 Test #3: madness/test/world/test_ar/run .............Failed 0.64 sec Start 4: madness/test/world/test_hashdc/run 4/26 Test #4: madness/test/world/test_hashdc/run .........Failed 0.64 sec Start 5: madness/test/world/test_hello/run 5/26 Test #5: madness/test/world/test_hello/run ..........Failed 0.64 sec Start 6: madness/test/world/test_atomicint/run 6/26 Test #6: madness/test/world/test_atomicint/run ......Failed 0.64 sec Start 7: madness/test/world/test_future/run 7/26 Test #7: madness/test/world/test_future/run .........Failed 0.64 sec Start 8: madness/test/world/test_future2/run 8/26 Test #8: madness/test/world/test_future2/run ........Failed 0.64 sec Start 9: madness/test/world/test_future3/run 9/26 Test #9: madness/test/world/test_future3/run ........Failed 0.64 sec Start 10: madness/test/world/test_dc/run 10/26 Test #10: madness/test/world/test_dc/run .............Failed 0.64 sec Start 11: madness/test/world/test_hashthreaded/run 11/26 Test #11: madness/test/world/test_hashthreaded/run ...Failed 0.64 sec Start 12: madness/test/world/test_queue/run 12/26 Test #12: madness/test/world/test_queue/run ..........Failed 0.63 sec Start 13: madness/test/world/test_world/run 13/26 Test #13: madness/test/world/test_world/run ..........Failed 0.66 sec Start 14: madness/test/world/test_worldprofile/run 14/26 Test #14: madness/test/world/test_worldprofile/run ...Failed 0.68 sec Start 15: madness/test/world/test_binsorter/run 15/26 Test #15: madness/test/world/test_binsorter/run ......Failed 0.64 sec Start 16: madness/test/world/test_vector/run 16/26 Test #16: madness/test/world/test_vector/run ......... Passed 0.35 sec Start 17: madness/test/world/test_worldptr/run 17/26 Test #17: madness/test/world/test_worldptr/run .......Failed 0.92 sec Start 18: madness/test/world/test_worldref/run 18/26 Test #18: madness/test/world/test_worldref/run .......Failed 0.87 sec Start 19: madness/test/world/test_stack/run 19/26 Test #19: madness/test/world/test_stack/run .......... Passed 0.12 sec Start 20: madness/test/world/test_googletest/run 20/26 Test #20: madness/test/world/test_googletest/run ..... Passed 0.66 sec Start 21: madness/test/world/test_tree/run 21/26 Test #21: madness/test/world/test_tree/run ...........Failed 0.64 sec Start 22: btas/unit/build 22/26 Test #22: btas/unit/build ............................ Passed 54.24 sec Start 23: btas/unit/run 23/26 Test #23: btas/unit/run .............................. Passed 215.30 sec Start 24: tiledarray/unit/build 24/26 Test #24: tiledarray/unit/build ...................... Passed 1082.45 sec Start 25: tiledarray/unit/run-np-1 25/26 Test #25: tiledarray/unit/run-np-1 ...................Failed 1.43 sec Start 26: tiledarray/unit/run-np-2 26/26 Test #26: tiledarray/unit/run-np-2 ...................***Failed 0.75 sec

27% tests passed, 19 tests failed out of 26

evaleev commented 9 months ago

I recommend you only run TA-specific tests. Build 'check-tiledarray' target.

It looks like the tests that are failing are madness tests. This is likely that MPI programs on your system cannot be launched directly ... Need srun?

On Thu, Oct 19, 2023, 1:48 PM Victor Anisimov @.***> wrote:

gcc/11.1.0 worked for me. Thank you very much for the suggestion! However, I see many tests failing.

Start 1: madness/test/world/build

1/26 Test #1 https://github.com/ValeevGroup/tiledarray/issues/1: madness/test/world/build ................... Passed 53.98 sec Start 2: madness/test/world/test_prof/run 2/26 Test #2 https://github.com/ValeevGroup/tiledarray/issues/2: madness/test/world/test_prof/run ...........Failed 1.19 sec Start 3: madness/test/world/test_ar/run 3/26 Test #3 https://github.com/ValeevGroup/tiledarray/issues/3: madness/test/world/test_ar/run .............Failed 0.64 sec Start 4: madness/test/world/test_hashdc/run 4/26 Test #4 https://github.com/ValeevGroup/tiledarray/issues/4: madness/test/world/test_hashdc/run .........Failed 0.64 sec Start 5: madness/test/world/test_hello/run 5/26 Test #5 https://github.com/ValeevGroup/tiledarray/pull/5: madness/test/world/test_hello/run ..........Failed 0.64 sec Start 6: madness/test/world/test_atomicint/run 6/26 Test #6 https://github.com/ValeevGroup/tiledarray/pull/6: madness/test/world/test_atomicint/run ......Failed 0.64 sec Start 7: madness/test/world/test_future/run 7/26 Test #7 https://github.com/ValeevGroup/tiledarray/pull/7: madness/test/world/test_future/run .........Failed 0.64 sec Start 8: madness/test/world/test_future2/run 8/26 Test #8 https://github.com/ValeevGroup/tiledarray/issues/8: madness/test/world/test_future2/run ........Failed 0.64 sec Start 9: madness/test/world/test_future3/run 9/26 Test #9 https://github.com/ValeevGroup/tiledarray/issues/9: madness/test/world/test_future3/run ........Failed 0.64 sec Start 10: madness/test/world/test_dc/run 10/26 Test #10 https://github.com/ValeevGroup/tiledarray/pull/10: madness/test/world/test_dc/run .............Failed 0.64 sec Start 11: madness/test/world/test_hashthreaded/run 11/26 Test #11 https://github.com/ValeevGroup/tiledarray/pull/11: madness/test/world/test_hashthreaded/run ...Failed 0.64 sec Start 12: madness/test/world/test_queue/run 12/26 Test #12 https://github.com/ValeevGroup/tiledarray/issues/12: madness/test/world/test_queue/run ..........Failed 0.63 sec Start 13: madness/test/world/test_world/run 13/26 Test #13 https://github.com/ValeevGroup/tiledarray/issues/13: madness/test/world/test_world/run ..........Failed 0.66 sec Start 14: madness/test/world/test_worldprofile/run 14/26 Test #14 https://github.com/ValeevGroup/tiledarray/issues/14: madness/test/world/test_worldprofile/run ...Failed 0.68 sec Start 15: madness/test/world/test_binsorter/run 15/26 Test #15 https://github.com/ValeevGroup/tiledarray/issues/15: madness/test/world/test_binsorter/run ......Failed 0.64 sec Start 16: madness/test/world/test_vector/run 16/26 Test #16 https://github.com/ValeevGroup/tiledarray/issues/16: madness/test/world/test_vector/run ......... Passed 0.35 sec Start 17: madness/test/world/test_worldptr/run 17/26 Test #17 https://github.com/ValeevGroup/tiledarray/issues/17: madness/test/world/test_worldptr/run .......Failed 0.92 sec Start 18: madness/test/world/test_worldref/run 18/26 Test #18 https://github.com/ValeevGroup/tiledarray/issues/18: madness/test/world/test_worldref/run .......Failed 0.87 sec Start 19: madness/test/world/test_stack/run 19/26 Test #19 https://github.com/ValeevGroup/tiledarray/issues/19: madness/test/world/test_stack/run .......... Passed 0.12 sec Start 20: madness/test/world/test_googletest/run 20/26 Test #20 https://github.com/ValeevGroup/tiledarray/issues/20: madness/test/world/test_googletest/run ..... Passed 0.66 sec Start 21: madness/test/world/test_tree/run 21/26 Test #21 https://github.com/ValeevGroup/tiledarray/pull/21: madness/test/world/test_tree/run ...........Failed 0.64 sec Start 22: btas/unit/build 22/26 Test #22 https://github.com/ValeevGroup/tiledarray/issues/22: btas/unit/build ............................ Passed 54.24 sec Start 23: btas/unit/run 23/26 Test #23 https://github.com/ValeevGroup/tiledarray/pull/23: btas/unit/run .............................. Passed 215.30 sec Start 24: tiledarray/unit/build 24/26 Test #24 https://github.com/ValeevGroup/tiledarray/pull/24: tiledarray/unit/build ...................... Passed 1082.45 sec Start 25: tiledarray/unit/run-np-1 25/26 Test #25 https://github.com/ValeevGroup/tiledarray/issues/25: tiledarray/unit/run-np-1 ...................Failed 1.43 sec Start 26: tiledarray/unit/run-np-2 26/26 Test #26 https://github.com/ValeevGroup/tiledarray/issues/26: tiledarray/unit/run-np-2 ...................***Failed 0.75 sec

27% tests passed, 19 tests failed out of 26

— Reply to this email directly, view it on GitHub https://github.com/ValeevGroup/tiledarray/issues/427#issuecomment-1771532679, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAQXIZ4BSIXMNJANNPQ4B5DYAFYWPAVCNFSM6AAAAAA6HR7QM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUZTENRXHE . You are receiving this because you commented.Message ID: @.***>

victor-anisimov commented 9 months ago

Yes, you are right! My MPI environment is broken. It crashes even on hello world. I'll rerun the tests on Polaris at some later time.

I have one more question, though. I saw the change in directory structure from cpu/cuda to host/device. Also a number of cuda suffixes in file names has been changed to suffix device, which is great. However, I still can see cpu_cuda_vector in src/TiledArray/device. Are you going to rename it?

For HIP/SYCL, you would also need to rename .cu files to .cpp and modify the CMake scripts accordingly ./src/TiledArray/device/cpu_cuda_vector.cu ./src/TiledArray/device/kernel/thrust/mult_kernel.cu ./src/TiledArray/device/kernel/thrust/reduce_kernel.cu ./src/TiledArray/device/um_storage.cu

Do you plan to make that change or keep the file extensions as they are? Thanks!

evaleev commented 9 months ago

I have one more question, though. I saw the change in directory structure from cpu/cuda to host/device. Also a number of cuda suffixes in file names has been changed to suffix device, which is great. However, I still can see cpu_cuda_vector in src/TiledArray/device. Are you going to rename it?

the problem with cpu_cuda_vector is that it requires Thrust. Initially I was not sure whether Thrust would be a viable option. It seems that ROCm now provides it, not so sure what the SYCL story is on this front (it probably is just standard C++ there?).

In any case, since cpu_cuda_vector is not actually used I do not plan to maintain it (will revive if needed). Probably should be marked deprecated.

For HIP/SYCL, you would also need to rename .cu files to .cpp and modify the CMake scripts accordingly ./src/TiledArray/device/cpu_cuda_vector.cu ./src/TiledArray/device/kernel/thrust/mult_kernel.cu ./src/TiledArray/device/kernel/thrust/reduce_kernel.cu ./src/TiledArray/device/um_storage.cu

See above re: cpu_cuda_vector.cu. Same story for um_storage.cu actually: unused, can be deprecated.

The rest are CUDA-specific implementation files ... they are actually portable, but CMake requires a separate file for each device-specific implementation so notice I just include mult_kernel.cu into mult_kernel.hip (using implementation header would be probably better but it is the level of pedantry I don't have time for).

For SYCL you would need to implement analogs of thrust/{mult,reduce}_kernel.{cu,hip} somewhere. Not in thrust subdirectory if not implemented using thrust.

victor-anisimov commented 9 months ago

It is up to you if you prefer keeping a separate file for device-specific implementation instead of keeping the alike kernels in one file next to each other separated by compiler preprocessor directives as it is done in LibreTT. Seeing all three platform-specific kernels in one place makes it easier to propagate the change applied to one kernel to the other two making the support easier. The directory thrust basically holds some vector operations. Intel provides similar functionality to thrust in dpl library. I have quite similarly looking SYCL implementation for these kernels.

Compare thrust::multiplies mul_op; thrust::transform( thrust::cuda::par.on(stream), thrust::device_pointer_cast(arg), thrust::device_pointer_cast(arg) + n, thrust::device_pointer_cast(result), thrust::device_pointer_cast(result), mul_op);

to std::multiplies mul_op; std::transform( dpl::execution::make_device_policy(*stream), dpct::get_device_pointer(arg), dpct::get_device_pointer(arg) + n, dpct::get_device_pointer(result), dpct::get_device_pointer(result), mul_op);

If it is not an insentive to make the kernel API more general (more vendor-independent) then so it be.

evaleev commented 5 months ago

status quo:

Because both CUDA and HIP provide same Thrust API then there are no platform-dependent code blocks there. I agree with your arguments, and I'm fine adding DPC++-specific extensions to these headers. I believe that the "kernel" functions will be platform independent (e.g. mult_kernel calls mult_kernel_thrust, all platform-dependence is localized in mult_kernel_thrust). We will need separate {mult,reduce}_kernel.cpp files for DPC++ code, but these can again #include a common implementation source files (currently, {mult,reduce}_kernel.cu; probably better to move to {mult,reduce}_kernel.ipp files).