failed to run reference on cpu without CUDA

TED-EE commented 1 month ago

Thanks for the help of issue #126 . I got a question when I tried to run reference on cpu without CUDA. The reproduce steps are as follows,

Requirements

pip install triton==2.2 (Requires Triton >= 2.2.0, <3.0.0)
pip install torch==2.1.2 (Requires PyTorch >= 2.1.2)
pip install transformers==4.42.3 (Requires Transformers >= 4.40.2)

Codebase

commit 95f5afaf0219c2085d1717e8cd85dff5cc7e3cdd (HEAD -> master)
Author: Clement Chan <iclementine@outlook.com>
Date:   Thu Jul 4 15:11:54 2024 +0800

    [codegen] generate gsl(grid-stride-loop) style pointwise kernel  (#91)

    * generate gsl(grid-stride-loop) style pointwise kernel to avoid grid_size exceeding the max grid size
    * add device guard around kernel launch
    * avoid assign to a constexpr since we are inlined into a loop
    * remove redundant code for rank-0 case

Installation:

git clone https://github.com/FlagOpen/FlagGems.git
cd FlagGems
pip install .

Run reference on cpu

cd tests
pytest test_unary_pointwise_ops.py::test_accuracy_abs[dtype0-shape0] --device cpu

Results

tests/test_unary_pointwise_ops.py F                                                                                                                                                                                                          [100%]

===================================================================================================================== FAILURES =====================================================================================================================
_________________________________________________________________________________________________________ test_accuracy_abs[dtype0-shape0] _________________________________________________________________________________________________________

shape = (1024, 1024), dtype = torch.float16

    @pytest.mark.parametrize("shape", POINTWISE_SHAPES)
    @pytest.mark.parametrize("dtype", FLOAT_DTYPES)
    def test_accuracy_abs(shape, dtype):
>       inp = torch.randn(shape, dtype=dtype, device="cuda")

tests/test_unary_pointwise_ops.py:19: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

    def _lazy_init():
        global _initialized, _queued_calls
        if is_initialized() or hasattr(_tls, "is_initializing"):
            return
        with _initialization_lock:
            # We be double-checked locking, boys!  This is OK because
            # the above test was GIL protected anyway.  The inner test
            # is for when a thread blocked on some other thread which was
            # doing the initialization; when they get the lock, they will
            # find there is nothing left to do.
            if is_initialized():
                return
            # It is important to prevent other threads from entering _lazy_init
            # immediately, while we are still guaranteed to have the GIL, because some
            # of the C calls we make below will release the GIL
            if _is_in_bad_fork():
                raise RuntimeError(
                    "Cannot re-initialize CUDA in forked subprocess. To use CUDA with "
                    "multiprocessing, you must use the 'spawn' start method"
                )
            if not hasattr(torch._C, "_cuda_getDeviceCount"):
                raise AssertionError("Torch not compiled with CUDA enabled")
            if _cudart is None:
                raise AssertionError(
                    "libcudart functions unavailable. It looks like you have a broken build?"
                )
            # This function throws if there's a driver initialization error, no GPUs
            # are found or any other error occurs
            if "CUDA_MODULE_LOADING" not in os.environ:
                os.environ["CUDA_MODULE_LOADING"] = "LAZY"
>           torch._C._cuda_init()
E           RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py:298: RuntimeError
================================================================================================================= warnings summary =================================================================================================================
../../../../../../../../../usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1233
  /usr/local/lib/python3.10/dist-packages/_pytest/config/__init__.py:1233: PytestConfigWarning: Unknown config option: pythonpath

    self._warn_or_fail_if_strict(f"Unknown config option: {key}\n")

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================================================================= short test summary info ==============================================================================================================
FAILED tests/test_unary_pointwise_ops.py::test_accuracy_abs[dtype0-shape0] - RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx
=========================================================================================================== 1 failed, 1 warning in 8.21s ===========================================================================================================

It makes me confused that when I tried to run the reference on cpu, actually no CUDA device is needed.

StrongSpoon commented 1 month ago

even run reference on cpu, the test program creates input tensors on CUDA first and then casts them to cpu. so cuda driver is needed here. btw, without cuda driver, you cannot run triton kernel either.

TED-EE commented 1 month ago

even run reference on cpu, the test program creates input tensors on CUDA first and then casts them to cpu. so cuda driver is needed here. btw, without cuda driver, you cannot run triton kernel either.

https://xie.infoq.cn/article/9ca517ab55eaf60361ed11889 -> "FlagGems 算子库完成后，大模型的开发者和使用者可以仅用一行代码将 ATen 算子替换为 FlagGems，便捷地部署到英伟达 GPU 或其他 AI 芯片上，而无需考虑代码修改或后端适配等问题。"

That means, we can run triton kernel on another AI platform other than CUDA GPU. Supposed I create input tensors and run reference on other AI platform, the existing code:

test/python/triton/third_party/FlagGems/tests/test_unary_pointwise_ops.py:
def test_accuracy_abs(shape, dtype):
    inp = torch.randn(shape, dtype=dtype, device="cuda")

seems not working properly because device="cuda" is hardcode. The only way I can figure out is modified "cuda" to the device name of other AI platform, which means I will modified the code: device="cuda" to device="another_device" of all the tests in FlagGems. Quite tedious and low scalability, think about everytime the user update the latest repo and replace all device="cuda" to device="another_device".

The solution I considered is modified conftest.py, add choices=["cuda", "cpu", "another_device"] and adapt it (like TO_CPU). Ideally, pytest test_unary_pointwise_ops.py::test_accuracy_abs[dtype0-shape0] --device another_device will work. However, the problem arised as mentioned in the very beginning of https://github.com/FlagOpen/FlagGems/issues/129, which is RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx because of hardcode of device="cuda".

StrongSpoon commented 1 month ago

we developers can not make sure what platform users run tests on and what device name will be used in tensor initialization. for the most common case, we initialize tensors on "cuda". on other ai platforms, replacing "cuda" with specific device name is welcomed. here is an example from branch cambricon: https://github.com/FlagOpen/FlagGems/blob/cambricon/tests/test_unary_pointwise_ops.py#L20, in which "mlu" can be chosen by pytest option.

TED-EE commented 1 month ago

will look into it, thanks a lot

TED-EE commented 1 month ago

@StrongSpoon The branch cambricon seems briliant, but another question is not fixed, think about everytime the developer pull the lastest repo, they need to solve the conflict of inp = torch.randn(shape, dtype=dtype, device="cuda") and inp = torch.randn(shape, dtype=dtype, device=DEVICE) also quite tedious and low scalability.

Many AI platform companies other than NIVIDA may utilize FlagGems. It may be a better solution to expose the device=DEVICE for the developer and assign "cuda" by default if the developer does not specify --device option, instead of hardcoding of device="cuda". In this approach, the non-CUDA developers can run the tests without solving conflict everytime they pull latest FlagGems repo, only add their AI device name in choices of the conftest.py.

StrongSpoon commented 1 month ago

thanks for your advice. we'll apply other device options into master branch after migration is finished.

TED-EE commented 1 month ago

cheers

uniartisan commented 1 month ago

I'm wondering, since we still use torch.mlu.synchronize() can we write a wrapper or an abstract adaptor?

StrongSpoon commented 1 month ago

may you describe in detail?

FlagOpen / FlagGems

failed to run reference on cpu without CUDA #129