Is it better to have a module-wise `device` tag for each multi-device suported class?

maki49 commented 1 month ago

Currently, device is an input parameter as global variable. When device=gpu, all the calculation will be done on GPUs. However, "the machine has GPUs" does not need "every module calculates on GPUs". For example, @dzzz2001 find that putting FFT module on GPU costs a lot of memory while having little accelerate effect when using LCAO basis.

A possible solution is to have a global list telling the device of each module with the class name as key-like template parameter, here's a demo:

#include <map>
#include <string>
#include <iostream>
class FFT;
class HSolver;

template<typename Tclass>
std::string modulewise_device(const std::string global_dev, const std::string basis) { return global_dev; }
template<> std::string modulewise_device<FFT>(const std::string global_dev, const std::string basis) { return basis == "pw" ? global_dev : "cpu"; }

int main()
{
    for (auto device : { "gpu", "cpu" })
    {
        std::cout << "global device=" << device << std::endl;
        for (auto basis : { "pw", "lcao" })
        {
            std::cout << "FFT[" << basis << "]:" << modulewise_device<FFT>(device, basis) << std::endl;
        }
    }
    return 0;
}

global device=gpu
FFT[pw]:gpu
FFT[lcao]:cpu
global device=cpu
FFT[pw]:cpu
FFT[lcao]:cpu

dzzz2001 commented 1 month ago

The issue of allocating a large amount of irrelevant GPU memory under the LCAO basis set due to setting the device to GPU can be referenced in issue #4442.

Cstandardlib commented 1 month ago

The kernel op currently used by hsolver is undergoing a round of refactoring. It also faces the same issues of supporting heterogeneous computing across multiple devices and the actual devices in use. The interface is a crucial matter, and we should standardize this in all our functionalities that utilize multiple devices such as FFT and kernel operators. I suggest we have a further discussion to align on a unified standard to ensure scalability and facilitate portability across various devices, including supercomputers. @Critsium-xy

Critsium-xy commented 1 month ago

@mohanchen provided an idea that directly implement heterogeneous computing in blas_connector. blas_connector.h now includes linking cblas kernels and encapsuling blas kernels. He want to sparate the two part. For example we can leave the declaration of BlasConnector::gemm in blas_connector.h, and implement this function in blas_connector.cpp. He also added that to make BlasConnector::gemm able to support different platforms, we can add a parameter in its parameters as a device flag, which default value represents CPU, and using this parameter to decide which kernel exactly to use (Cublas kernel? Cblas kernel? or hipblas kernel?). This may have performance cost but I may not so huge (I havent tested it yet). After finishing this you can directly throw away other blas encapsultations such as ops in module_hsolver or in @denghuilu 's tensor. But in fact I dont know whether it is exactly a good idea.

deepmodeling / abacus-develop

Is it better to have a module-wise `device` tag for each multi-device suported class? #5231