Closed MaxwellF1 closed 3 weeks ago
@MaxwellF1 calls to {cu,roc}BLAS do not occur directly, instead we use the awesome blaspp API which provides the proper abstractions to use BLAS on host and device. Calls to device-specific blaspp
functions can be found in https://github.com/ValeevGroup/tiledarray/blob/master/src/TiledArray/device/btas.h (note the extra "queue" aka stream argument). Some operations are implemented directly (search for thrust, used to implement reductions, etc.).
Currently to dispatch to CUDA/ROCm/HIP-capable devices you need to construct DistArrays that lives in memory spaces accessible to them. The recommended space is Unified Memory (which is automatically paged in/out of the device by the device driver), this way you can deal with arrays that do not fit into the GPU memory. Example use can be found here: https://github.com/ValeevGroup/tiledarray/blob/master/examples/device/ta_dense_device.cpp
Hi, great work! I have some questions about the CUDA support. I want to use tiled array for tensor contraction on GPU platforms. Does the current implementation perform the whole tensor contraction process on the GPU? In the source code, I only saw calls to cuTT transpose and some other auxiliary kernels, but I did not find any calls to cuBLAS in the implementation of the “*” operator, although it seems that cuBLAS is explicitly specified as a library dependency?