Crash in test_modules.py on H-100

deroholic commented 1 year ago

During the first step of training, I get a crash (trace below.) I then tried to run test_modules.py and it fails also (report below).

$ python -m bitsandbytes 

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

 and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
bin /home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/libbitsandbytes_cuda121.so
CUDA SETUP: CUDA runtime path found: /home/ubuntu/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 9.0
CUDA SETUP: Detected CUDA version 121
CUDA SETUP: Loading binary /home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/libbitsandbytes_cuda121.so...
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++ BUG REPORT INFORMATION ++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++ ANACONDA CUDA PATHS ++++++++++++++++++++
/home/ubuntu/miniconda3/envs/textgen/nsight-compute/2023.1.0/target/linux-desktop-t210-a64/libcuda-injection.so
/home/ubuntu/miniconda3/envs/textgen/nsight-compute/2023.1.0/target/linux-desktop-glibc_2_19_0-ppc64le/libcuda-injection.so
/home/ubuntu/miniconda3/envs/textgen/nsight-compute/2023.1.0/target/linux-desktop-glibc_2_11_3-x64/libcuda-injection.so
/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libc10_cuda.so
/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libtorch_cuda_linalg.so
/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so
/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/libbitsandbytes_cuda121.so
/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/t/libbitsandbytes_cuda120.so
/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/t/libbitsandbytes_cuda118.so
/home/ubuntu/miniconda3/envs/textgen/lib/libcudart.so
/home/ubuntu/miniconda3/envs/textgen/lib/stubs/libcuda.so

++++++++++++++++++ /usr/local CUDA PATHS +++++++++++++++++++
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so

+++++++++++++++ WORKING DIRECTORY CUDA PATHS +++++++++++++++

++++++++++++++++++ LD_LIBRARY CUDA PATHS +++++++++++++++++++
++++++++++ /usr/lib/x86_64-linux-gnu/ CUDA PATHS +++++++++++
/usr/lib/x86_64-linux-gnu/libcuda.so
/usr/lib/x86_64-linux-gnu/libcudart.so
/usr/lib/x86_64-linux-gnu/stubs/libcuda.so
 /usr/local/cuda-12.1/targets/x86_64-linux/lib/ CUDA PATHS +
/usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudart.so
/usr/local/cuda-12.1/targets/x86_64-linux/lib/stubs/libcuda.so

++++++++++++++++++++++++++ OTHER +++++++++++++++++++++++++++
COMPILED_WITH_CUDA = True
COMPUTE_CAPABILITIES_PER_GPU = ['9.0']
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++ DEBUG INFO END ++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Running a quick check that:
    + library is importable
    + CUDA function is callable

WARNING: Please be sure to sanitize sensible info from any such env vars!

SUCCESS!
Installation was successful!

cuBLAS API failed with status 15
error detectedA: torch.Size([304, 4096]), B: torch.Size([4096, 4096]), C: (304, 4096); (lda, ldb, ldc): (c_int(9728), c_int(131072), c_int(9728)); (m, n, k): (c_int(304), c_int(4096), c_int(4096))
Traceback (most recent call last):
  File "/home/ubuntu/train/train.py", line 318, in <module>
    fire.Fire(train)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/home/ubuntu/train/train.py", line 308, in train
    trainer.train(resume_from_checkpoint=resume_from_checkpoint)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 1662, in train
    return inner_training_loop(
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 1929, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 2699, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/trainer.py", line 2731, in compute_loss
    outputs = model(**inputs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1724, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/peft/peft_model.py", line 686, in forward
    return self.base_model(
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 569, in forward
    layer_outputs = torch.utils.checkpoint.checkpoint(
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 249, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/utils/checkpoint.py", line 107, in forward
    outputs = run_function(*args)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 565, in custom_forward
    return module(*inputs, output_attentions, None)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/peft/tuners/lora.py", line 698, in forward
    result = super().forward(x)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/nn/modules.py", line 320, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/autograd/_functions.py", line 500, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/autograd/_functions.py", line 397, in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
  File "/home/ubuntu/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/functional.py", line 1436, in igemmlt
    raise Exception('cublasLt ran into an error!')
Exception: cublasLt ran into an error!

$ pytest -x test_modules.py
============================= test session starts ==============================
platform linux -- Python 3.10.9, pytest-7.3.1, pluggy-1.0.0
rootdir: /home/ubuntu/bitsandbytes
plugins: anyio-3.6.2
collected 7 items

test_modules.py cuBLAS API failed with status 15
error detectedF

=================================== FAILURES ===================================
__________________ test_linear8bitlt_inference[threshold_0.0] __________________

threshold = 0.0

    @pytest.mark.parametrize("threshold", values, ids=names)
    def test_linear8bitlt_inference(threshold):
        l1 = bnb.nn.Linear8bitLt(32, 64, threshold=threshold).cuda().half()
        assert l1.weight.device.type == "cuda"
        assert l1.weight.dtype == torch.float16

        l1.eval()
        for i in range(100):
            b1 = torch.randn(16, 8, 32, device="cuda").half()
>           o1 = l1(b1)

test_modules.py:327: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
../../miniconda3/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py:1501: in _call_impl
    return forward_call(*args, **kwargs)
../../miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/nn/modules.py:320: in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
../../miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/autograd/_functions.py:500: in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
../../miniconda3/envs/textgen/lib/python3.10/site-packages/torch/autograd/function.py:506: in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
../../miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/autograd/_functions.py:397: in forward
    out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

A = tensor([[ -43,  -51,   96,  ...,   90,    5,   -3],
        [  51,  -77,  -46,  ...,  -11, -106,   88],
        [  49,... ...,   39,  -33,   18],
        [ -11,   17,   78,  ...,  -17,  -14, -127]], device='cuda:0',
       dtype=torch.int8)
B = tensor([[ -16,   86,  -92,  ...,   81,  -22,  -85],
        [  79,  -51, -122,  ...,  101, -106, -109],
        [ -48,... ...,  -70,  -27,  -83],
        [   3, -120,  -65,  ...,  -54,  -33,  -52]], device='cuda:0',
       dtype=torch.int8)
SA = (torch.Size([128, 32]), 'col32'), SB = (torch.Size([64, 32]), 'col_turing')
out = tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
   ......, 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]], device='cuda:0', dtype=torch.int32)
Sout = ((128, 64), 'col32'), dtype = torch.int32

    def igemmlt(A, B, SA, SB, out=None, Sout=None, dtype=torch.int32):
        shapeA = SA[0]
        shapeB = SB[0]
        dimsA = len(shapeA)
        dimsB = len(shapeB)
        assert dimsB == 2, 'Only two dimensional matrices are supported for argument B'
        if dimsA == 2:
            m = shapeA[0]
        elif dimsA == 3:
            m = shapeA[0] * shapeA[1]

        rows = n = shapeB[0]
        assert prod(list(shapeA)) > 0, f'Input tensor dimensions need to be > 0: {shapeA}'

        # if the tensor is empty, return a transformed empty tensor with the right dimensions
        if shapeA[0] == 0 and dimsA == 2:
            return torch.empty((0, shapeB[0]), device=A.device, dtype=torch.float16)
        elif shapeA[1] == 0 and dimsA == 3:
            return torch.empty(tuple(shapeA[:2] + [shapeB[0]]), device=A.device, dtype=torch.float16)

        if dimsA == 2 and out is None:
            out, Sout = get_transform_buffer(
                (shapeA[0], shapeB[0]), dtype, A.device, "col32", "row"
            )
        elif dimsA == 3 and out is None:
            out, Sout = get_transform_buffer(
                (shapeA[0], shapeA[1], shapeB[0]), dtype, A.device, "col32", "row"
            )

        assert dimsB != 3, "len(B.shape)==3 not supported"
        assert A.device.type == "cuda"
        assert B.device.type == "cuda"
        assert A.dtype == torch.int8
        assert B.dtype == torch.int8
        assert out.dtype == dtype
        assert SA[1] == "col32"
        assert SB[1] in ["col_turing", "col_ampere"]
        assert Sout[1] == "col32"
        assert (
            shapeA[-1] == shapeB[-1]
        ), f"Matmullt only supports A @ B^T. Inner matrix dimensions do not match: A @ B = {shapeA} @ {shapeB}"
        formatB = SB[1]
        prev_device = A.device
        torch.cuda.set_device(A.device)

        ptr = CUBLAS_Context.get_instance().get_context(A.device)
        ptrA = get_ptr(A)
        ptrB = get_ptr(B)
        ptrC = get_ptr(out)

        k = shapeA[-1]
        lda = ct.c_int32(m * 32)
        if formatB == "col_turing":
            # turing: tiles with rows filled up to multiple of 8 rows by 32 columns
            # n = rows
            ldb = ct.c_int32(((rows + 7) // 8) * 8 * 32)
        else:
            # ampere: tiles with rows filled up to multiple of 32 rows by 32 columns
            # n = rows
            ldb = ct.c_int32(((rows + 31) // 32) * 32 * 32)

        ldc = ct.c_int32(m * 32)
        m = ct.c_int32(m)
        n = ct.c_int32(n)
        k = ct.c_int32(k)

        has_error = 0
        ptrRowScale = get_ptr(None)
        is_on_gpu([A, B, out])
        if formatB == 'col_turing':
            if dtype == torch.int32:
                has_error = lib.cigemmlt_turing_32(
                    ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
                )
            else:
                has_error = lib.cigemmlt_turing_8(
                    ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
                )
        elif formatB == "col_ampere":
            if dtype == torch.int32:
                has_error = lib.cigemmlt_ampere_32(
                    ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
                )
            else:
                has_error = lib.cigemmlt_ampere_8(
                    ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
                )

        if has_error == 1:
            print(f'A: {shapeA}, B: {shapeB}, C: {Sout[0]}; (lda, ldb, ldc): {(lda, ldb, ldc)}; (m, n, k): {(m, n, k)}')
>           raise Exception('cublasLt ran into an error!')
E           Exception: cublasLt ran into an error!

../../miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes-0.38.1-py3.10.egg/bitsandbytes/functional.py:1436: Exception
----------------------------- Captured stdout call -----------------------------
A: torch.Size([128, 32]), B: torch.Size([64, 32]), C: (128, 64); (lda, ldb, ldc): (c_int(4096), c_int(2048), c_int(4096)); (m, n, k): (c_int(128), c_int(64), c_int(32))
=========================== short test summary info ============================
FAILED test_modules.py::test_linear8bitlt_inference[threshold_0.0] - Exceptio...
!!!!!!!!!!!!!!!!!!!!!!!!!! stopping after 1 failures !!!!!!!!!!!!!!!!!!!!!!!!!!!
============================== 1 failed in 6.67s ===============================

Thu May 11 02:31:58 2023       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA H100 PCIe                On | 00000000:06:00.0 Off |                    0 |
| N/A   38C    P0               48W / 350W|      0MiB / 81559MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

deroholic commented 1 year ago

I'm not really equipped to debug this issue, but a quick look and this seems a little strange:

https://github.com/TimDettmers/bitsandbytes/blob/9e7cdc9ea95e9756d9f5621a0e2c7e2538363fae/bitsandbytes/functional.py#L279

H-100 is a sm_90 device. Does this mean that sm_90 uses the col_turing format again?

Malfaro43 commented 1 year ago

@deroholic I'm also running into this issue on an H100. Have you fixed this yet, and if so what is your solution?

deroholic commented 1 year ago

No. I am awaiting official response. Somewhat critical as it impacts all of our plans to scale with the H-100.

Ar770 commented 1 year ago

Hi, I was in touch with the lambdalabs team about the same error, and this is what they've found out: I tried to run the notebook on an A10 and I got past the error. Looking closely into the code (~/.local/lib/python3.8/site-packages/bitsandbytes/functional.py) that throws the cublasLt ran into an error! prompt, it seems that it has conditions to check the architecture of the GPU (Turing or Ampere):


    has_error = 0
    ptrRowScale = get_ptr(None)
    is_on_gpu([A, B, out])
    if formatB == 'col_turing':
        if dtype == torch.int32:
            has_error = lib.cigemmlt_turing_32(
                ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
            )
        else:
            has_error = lib.cigemmlt_turing_8(
                ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
            )
    elif formatB == "col_ampere":
        if dtype == torch.int32:
            has_error = lib.cigemmlt_ampere_32(
                ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
            )
        else:
            has_error = lib.cigemmlt_ampere_8(
                ptr, m, n, k, ptrA, ptrB, ptrC, ptrRowScale, lda, ldb, ldc
            )

    if has_error == 1:
        print(f'A: {shapeA}, B: {shapeB}, C: {Sout[0]}; (lda, ldb, ldc): {(lda, ldb, ldc)}; (m, n, k): {(m, n, k)}')
        raise Exception('cublasLt ran into an error!')

An A10 is built under the Ampere architecture. An H100 is using the Hopper architecture - which is not any of the code's conditions. I think that the program you are trying to run is not yet compatible with the H100. You could reach out to the developer to confirm this.

If this is actually the problem that causes the error, do you have any plans to take care of it in the near future? Thanks

github-actions[bot] commented 9 months ago

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

bitsandbytes-foundation / bitsandbytes

Crash in test_modules.py on H-100 #383