ValeevGroup / tiledarray

A massively-parallel, block-sparse tensor framework written in C++
GNU General Public License v3.0
259 stars 54 forks source link

Questions about CUDA support #475

Closed MaxwellF1 closed 3 weeks ago

MaxwellF1 commented 1 month ago

Hi, great work! I have some questions about the CUDA support. I want to use tiled array for tensor contraction on GPU platforms. Does the current implementation perform the whole tensor contraction process on the GPU? In the source code, I only saw calls to cuTT transpose and some other auxiliary kernels, but I did not find any calls to cuBLAS in the implementation of the “*” operator, although it seems that cuBLAS is explicitly specified as a library dependency?

evaleev commented 1 month ago

@MaxwellF1 calls to {cu,roc}BLAS do not occur directly, instead we use the awesome blaspp API which provides the proper abstractions to use BLAS on host and device. Calls to device-specific blaspp functions can be found in https://github.com/ValeevGroup/tiledarray/blob/master/src/TiledArray/device/btas.h (note the extra "queue" aka stream argument). Some operations are implemented directly (search for thrust, used to implement reductions, etc.).

Currently to dispatch to CUDA/ROCm/HIP-capable devices you need to construct DistArrays that lives in memory spaces accessible to them. The recommended space is Unified Memory (which is automatically paged in/out of the device by the device driver), this way you can deal with arrays that do not fit into the GPU memory. Example use can be found here: https://github.com/ValeevGroup/tiledarray/blob/master/examples/device/ta_dense_device.cpp