Questions about CUDA support

@MaxwellF1 calls to {cu,roc}BLAS do not occur directly, instead we use the awesome blaspp API which provides the proper abstractions to use BLAS on host and device. Calls to device-specific blaspp functions can be found in https://github.com/ValeevGroup/tiledarray/blob/master/src/TiledArray/device/btas.h (note the extra "queue" aka stream argument). Some operations are implemented directly (search for thrust, used to implement reductions, etc.).

Currently to dispatch to CUDA/ROCm/HIP-capable devices you need to construct DistArrays that lives in memory spaces accessible to them. The recommended space is Unified Memory (which is automatically paged in/out of the device by the device driver), this way you can deal with arrays that do not fit into the GPU memory. Example use can be found here: https://github.com/ValeevGroup/tiledarray/blob/master/examples/device/ta_dense_device.cpp

ValeevGroup / tiledarray

Questions about CUDA support #475