facebookresearch / xformers

Hackable and optimized Transformers building blocks, supporting a composable construction.
https://facebookresearch.github.io/xformers/
Other
8.71k stars 623 forks source link

6 tests failing in test_core_attention, #544

Open Thomas-MMJ opened 2 years ago

Thomas-MMJ commented 2 years ago

🐛 Bug

In test_core_attention the tests test_switch_blocksparse_dropout[0.0-True-cuda], test_switch_blocksparse_dropout[0.0-False-cuda], test_switch_blocksparse_dims[cuda], test_switch_blocksparse_dropout[0.3-False-cuda], test_switch_blocksparse[data_type1-cuda] , test_switch_blocksparse_dropout[0.3-True-cuda] all fail.

here is the output,

pytest tests/test_core_attention.py
=================================================================== test session starts ====================================================================
platform linux -- Python 3.9.13, pytest-7.2.0, pluggy-1.0.0
Using --randomly-seed=1038051739
rootdir: /mnt/c/Users/tommu/xformers
plugins: randomly-3.12.0, timeout-2.1.0, forked-1.4.0, hydra-core-1.2.0, xdist-3.0.2, picked-0.4.6, anyio-3.6.2
collected 17 items

tests/test_core_attention.py .F....F..FFF....F                                                                                                       [100%]

========================================================================= FAILURES =========================================================================
______________________________________________________ test_switch_blocksparse_dropout[0.0-True-cuda] ______________________________________________________
A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0...., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-f24b6aa9b101a518b6a4a6bddded372e-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', torch.int32), (128, 128, 32, 128, True), (True, True, True, (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), True))

<string>:21: KeyError

During handling of the above exception, another exception occurred:

device = 'cuda', training = True, drop_prob = 0.0

    @pytest.mark.skipif(
        not _is_blocksparse_available, reason="Blocksparse is not available"
    )
    @pytest.mark.parametrize("device", ["cuda"])
    @pytest.mark.parametrize("training", [True, False])
    @pytest.mark.parametrize("drop_prob", [0.0, 0.3])
    def test_switch_blocksparse_dropout(device, training, drop_prob):
        b, s, d = 8, 128, 32

        a = torch.rand(b, s, d, device=device)

        m = AttentionMask.make_causal(s, s, device)
        dropout = nn.Dropout(drop_prob)
        dropout.train(training).cuda()

        with torch.cuda.amp.autocast():
>           r = scaled_dot_product_attention(a, a, a, m)

tests/test_core_attention.py:203:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xformers/components/attention/core.py:328: in scaled_dot_product_attention
    return blocksparse_attention(q, k, v, dropout, block_size)
xformers/components/attention/core.py:289: in blocksparse_attention
    att = blocksparse_attention(q, k, v)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py:1423: in _call_impl
    return forward_call(*input, **kwargs)
xformers/components/attention/blocksparse.py:179: in forward
    sparse_att_mat = self.sparse_dot_sdd(q, k)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:430: in __call__
    c = _matmul.apply(
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:363: in forward
    c = _matmul.fn[mode](a, b, trans_a, trans_b, trans_c, spdims, block, c_lut, c_width, out=out)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:106: in sdd_matmul
    _sdd_kernel[grid](
/home/letusernameterrip/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/jit.py:106: in launcher
    return self.run(*args, grid=grid, **kwargs)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/autotuner.py:200: in run
    return self.fn.run(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0...., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   RuntimeError: Triton Error [CUDA]: invalid argument

<string>:43: RuntimeError
_____________________________________________________ test_switch_blocksparse_dropout[0.0-False-cuda] ______________________________________________________

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
            0.0000e+00,  0.0000e+00],
          [...   [ 4.2526e-16,  3.6108e-16,  4.0099e-16,  ...,  0.0000e+00,
            0.0000e+00,  0.0000e+00]]]], device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-f24b6aa9b101a518b6a4a6bddded372e-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', torch.int32), (128, 128, 32, 128, True), (True, True, True, (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), True))

<string>:21: KeyError

During handling of the above exception, another exception occurred:

device = 'cuda', training = False, drop_prob = 0.0

    @pytest.mark.skipif(
        not _is_blocksparse_available, reason="Blocksparse is not available"
    )
    @pytest.mark.parametrize("device", ["cuda"])
    @pytest.mark.parametrize("training", [True, False])
    @pytest.mark.parametrize("drop_prob", [0.0, 0.3])
    def test_switch_blocksparse_dropout(device, training, drop_prob):
        b, s, d = 8, 128, 32

        a = torch.rand(b, s, d, device=device)

        m = AttentionMask.make_causal(s, s, device)
        dropout = nn.Dropout(drop_prob)
        dropout.train(training).cuda()

        with torch.cuda.amp.autocast():
>           r = scaled_dot_product_attention(a, a, a, m)

tests/test_core_attention.py:203:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xformers/components/attention/core.py:328: in scaled_dot_product_attention
    return blocksparse_attention(q, k, v, dropout, block_size)
xformers/components/attention/core.py:289: in blocksparse_attention
    att = blocksparse_attention(q, k, v)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py:1423: in _call_impl
    return forward_call(*input, **kwargs)
xformers/components/attention/blocksparse.py:179: in forward
    sparse_att_mat = self.sparse_dot_sdd(q, k)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:430: in __call__
    c = _matmul.apply(
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:363: in forward
    c = _matmul.fn[mode](a, b, trans_a, trans_b, trans_c, spdims, block, c_lut, c_width, out=out)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:106: in sdd_matmul
    _sdd_kernel[grid](
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/jit.py:106: in launcher
    return self.run(*args, grid=grid, **kwargs)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/autotuner.py:200: in run
    return self.fn.run(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  0.0000e+00,
            0.0000e+00,  0.0000e+00],
          [...   [ 4.2526e-16,  3.6108e-16,  4.0099e-16,  ...,  0.0000e+00,
            0.0000e+00,  0.0000e+00]]]], device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   RuntimeError: Triton Error [CUDA]: invalid argument

<string>:43: RuntimeError
____________________________________________________________ test_switch_blocksparse_dims[cuda] ____________________________________________________________

A = tensor([[[[0.3637, 0.1267, 0.3611, 0.2961],
          [0.3172, 0.4863, 0.3122, 0.2344],
          [0.2237, 0.3370, 0.1...0, 0.3843],
          [0.0312, 0.2253, 0.3568, 0.3830],
          [0.2060, 0.3877, 0.4917, 0.2456]]]], device='cuda:0')
B = tensor([[[[0.7274, 0.2535, 0.7223, 0.5923],
          [0.6345, 0.9727, 0.6243, 0.4687],
          [0.4474, 0.6741, 0.2...1, 0.7685],
          [0.0624, 0.4507, 0.7136, 0.7659],
          [0.4121, 0.7753, 0.9835, 0.4911]]]], device='cuda:0')
C = tensor([[[[7.2754e-01, 2.5342e-01, 7.2217e-01,  ..., 2.2364e-01,
           8.3341e-01, 3.5529e-01],
          [5.9952...         [1.3883e-20, 2.0766e-20, 5.6149e-21,  ..., 4.3436e-20,
           5.0187e-19, 1.3764e-18]]]], device='cuda:0')
stride_za = 4096, stride_ha = 512, stride_ma = 4, stride_ak = 1, stride_zb = 4096, stride_hb = 512, stride_bk = 1, stride_nb = 4, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 4, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = False, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-f24b6aa9b101a518b6a4a6bddded372e-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', torch.int32), (128, 128, 32, 128, False), (True, True, True, (True, False), (True, False), (False, False), (False, True), (True, False), (True, False), (False, True), (False, False), (True, False), (True, False), (True, False), (False, True), (False, False), (True, False), True))

<string>:21: KeyError

During handling of the above exception, another exception occurred:

device = 'cuda'

    @pytest.mark.skipif(
        not _is_blocksparse_available, reason="Blocksparse is not available"
    )
    @pytest.mark.parametrize("device", ["cuda"])
    def test_switch_blocksparse_dims(device):
        b, s, d, nh = 8, 128, 32, 8
        hs = d // nh

        data_type = torch.float32
        a = torch.rand(b, nh, s, hs, device=device, dtype=data_type)
        # Mask with causal flag
        m = AttentionMask.make_causal(s, s, device, dtype=a.dtype)

        # Check that passing qkv with shape (B, nh, S, hs) is properly handled
        with torch.cuda.amp.autocast():
>           r = scaled_dot_product_attention(a, a, a, m)

tests/test_core_attention.py:181:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xformers/components/attention/core.py:328: in scaled_dot_product_attention
    return blocksparse_attention(q, k, v, dropout, block_size)
xformers/components/attention/core.py:289: in blocksparse_attention
    att = blocksparse_attention(q, k, v)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py:1423: in _call_impl
    return forward_call(*input, **kwargs)
xformers/components/attention/blocksparse.py:179: in forward
    sparse_att_mat = self.sparse_dot_sdd(q, k)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:430: in __call__
    c = _matmul.apply(
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:363: in forward
    c = _matmul.fn[mode](a, b, trans_a, trans_b, trans_c, spdims, block, c_lut, c_width, out=out)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:106: in sdd_matmul
    _sdd_kernel[grid](
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/jit.py:106: in launcher
    return self.run(*args, grid=grid, **kwargs)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/autotuner.py:200: in run
    return self.fn.run(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([[[[0.3637, 0.1267, 0.3611, 0.2961],
          [0.3172, 0.4863, 0.3122, 0.2344],
          [0.2237, 0.3370, 0.1...0, 0.3843],
          [0.0312, 0.2253, 0.3568, 0.3830],
          [0.2060, 0.3877, 0.4917, 0.2456]]]], device='cuda:0')
B = tensor([[[[0.7274, 0.2535, 0.7223, 0.5923],
          [0.6345, 0.9727, 0.6243, 0.4687],
          [0.4474, 0.6741, 0.2...1, 0.7685],
          [0.0624, 0.4507, 0.7136, 0.7659],
          [0.4121, 0.7753, 0.9835, 0.4911]]]], device='cuda:0')
C = tensor([[[[7.2754e-01, 2.5342e-01, 7.2217e-01,  ..., 2.2364e-01,
           8.3341e-01, 3.5529e-01],
          [5.9952...         [1.3883e-20, 2.0766e-20, 5.6149e-21,  ..., 4.3436e-20,
           5.0187e-19, 1.3764e-18]]]], device='cuda:0')
stride_za = 4096, stride_ha = 512, stride_ma = 4, stride_ak = 1, stride_zb = 4096, stride_hb = 512, stride_bk = 1, stride_nb = 4, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 4, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = False, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   RuntimeError: Triton Error [CUDA]: invalid argument

<string>:43: RuntimeError
_____________________________________________________ test_switch_blocksparse_dropout[0.3-False-cuda] ______________________________________________________

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[0.1286, 0.0448, 0.1277,  ..., 0.0628, 0.1654, 0.0160],
          [0.0491, 0.0933, 0.1176,  ..., 0.0101, 0.0..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]]],
       device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-f24b6aa9b101a518b6a4a6bddded372e-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', torch.int32), (128, 128, 32, 128, True), (True, True, True, (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), True))

<string>:21: KeyError

During handling of the above exception, another exception occurred:

device = 'cuda', training = False, drop_prob = 0.3

    @pytest.mark.skipif(
        not _is_blocksparse_available, reason="Blocksparse is not available"
    )
    @pytest.mark.parametrize("device", ["cuda"])
    @pytest.mark.parametrize("training", [True, False])
    @pytest.mark.parametrize("drop_prob", [0.0, 0.3])
    def test_switch_blocksparse_dropout(device, training, drop_prob):
        b, s, d = 8, 128, 32

        a = torch.rand(b, s, d, device=device)

        m = AttentionMask.make_causal(s, s, device)
        dropout = nn.Dropout(drop_prob)
        dropout.train(training).cuda()

        with torch.cuda.amp.autocast():
>           r = scaled_dot_product_attention(a, a, a, m)

tests/test_core_attention.py:203:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xformers/components/attention/core.py:328: in scaled_dot_product_attention
    return blocksparse_attention(q, k, v, dropout, block_size)
xformers/components/attention/core.py:289: in blocksparse_attention
    att = blocksparse_attention(q, k, v)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py:1423: in _call_impl
    return forward_call(*input, **kwargs)
xformers/components/attention/blocksparse.py:179: in forward
    sparse_att_mat = self.sparse_dot_sdd(q, k)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:430: in __call__
    c = _matmul.apply(
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:363: in forward
    c = _matmul.fn[mode](a, b, trans_a, trans_b, trans_c, spdims, block, c_lut, c_width, out=out)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:106: in sdd_matmul
    _sdd_kernel[grid](
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/jit.py:106: in launcher
    return self.run(*args, grid=grid, **kwargs)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/autotuner.py:200: in run
    return self.fn.run(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[0.1286, 0.0448, 0.1277,  ..., 0.0628, 0.1654, 0.0160],
          [0.0491, 0.0933, 0.1176,  ..., 0.0101, 0.0..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]]],
       device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   RuntimeError: Triton Error [CUDA]: invalid argument

<string>:43: RuntimeError
_________________________________________________________ test_switch_blocksparse[data_type1-cuda] _________________________________________________________

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[2.2452, 1.2301, 1.1070,  ..., 1.5308, 1.5791, 1.7538],
          [1.2694, 1.1050, 1.4207,  ..., 1.5606, 1.1..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]]],
       device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-f24b6aa9b101a518b6a4a6bddded372e-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', torch.int32), (128, 128, 32, 128, True), (True, True, True, (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), True))

<string>:21: KeyError

During handling of the above exception, another exception occurred:

device = 'cuda', data_type = torch.float32

    @pytest.mark.skipif(
        not _is_blocksparse_available, reason="Blocksparse is not available"
    )
    @pytest.mark.parametrize("device", ["cuda"])
    @pytest.mark.parametrize("data_type", [torch.float16, torch.float32])
    def test_switch_blocksparse(device, data_type):
        b, s, d = 8, 128, 32

        a = torch.rand(b, s, d, device=device, dtype=data_type)

        # Custom causal mask
        m_custom = torch.triu(
            torch.ones(s, s, device=device, dtype=a.dtype) * float("-inf"), diagonal=1
        )
        m_custom_bool = m_custom != float("-inf")
        m_sparse = SparseCS(m_custom_bool, device)
        # Mask with causal flag
        m_att_mask = AttentionMask.make_causal(s, s, device, dtype=a.dtype)

        # Check that a switch to blocksparse is only triggered by causal flag
        with torch.cuda.amp.autocast():
            r_custom = scaled_dot_product_attention(a, a, a, m_custom)
            r_sparse = scaled_dot_product_attention(a, a, a, m_sparse)
>           r_att_mask = scaled_dot_product_attention(a, a, a, m_att_mask)

tests/test_core_attention.py:155:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xformers/components/attention/core.py:328: in scaled_dot_product_attention
    return blocksparse_attention(q, k, v, dropout, block_size)
xformers/components/attention/core.py:289: in blocksparse_attention
    att = blocksparse_attention(q, k, v)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py:1423: in _call_impl
    return forward_call(*input, **kwargs)
xformers/components/attention/blocksparse.py:179: in forward
    sparse_att_mat = self.sparse_dot_sdd(q, k)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:430: in __call__
    c = _matmul.apply(
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:363: in forward
    c = _matmul.fn[mode](a, b, trans_a, trans_b, trans_c, spdims, block, c_lut, c_width, out=out)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:106: in sdd_matmul
    _sdd_kernel[grid](
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/jit.py:106: in launcher
    return self.run(*args, grid=grid, **kwargs)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/autotuner.py:200: in run
    return self.fn.run(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[2.2452, 1.2301, 1.1070,  ..., 1.5308, 1.5791, 1.7538],
          [1.2694, 1.1050, 1.4207,  ..., 1.5606, 1.1..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000]]]],
       device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   RuntimeError: Triton Error [CUDA]: invalid argument

<string>:43: RuntimeError
______________________________________________________ test_switch_blocksparse_dropout[0.3-True-cuda] ______________________________________________________

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0...., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   KeyError: ('2-.-0-.-0-d82511111ad128294e9d31a6ac684238-7929002797455b30efce6e41eddc6b57-3aa563e00c5c695dd945e23b09a86848-f24b6aa9b101a518b6a4a6bddded372e-ff946bd4b3b4a4cbdf8cedc6e1c658e0-5c5e32ff210f3b7f56c98ca29917c25e-06f0df2d61979d629033f4a22eff5198-0dd03b0bd512a184b3512b278d9dfa59-d35ab04ae841e2714a253c523530b071', (torch.float32, torch.float32, torch.float32, 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', 'i32', torch.int32), (128, 128, 32, 128, True), (True, True, True, (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), (False, True), (True, False), (True, False), (True, False), (True, False), (False, True), (True, False), (True, False), True))

<string>:21: KeyError

During handling of the above exception, another exception occurred:

device = 'cuda', training = True, drop_prob = 0.3

    @pytest.mark.skipif(
        not _is_blocksparse_available, reason="Blocksparse is not available"
    )
    @pytest.mark.parametrize("device", ["cuda"])
    @pytest.mark.parametrize("training", [True, False])
    @pytest.mark.parametrize("drop_prob", [0.0, 0.3])
    def test_switch_blocksparse_dropout(device, training, drop_prob):
        b, s, d = 8, 128, 32

        a = torch.rand(b, s, d, device=device)

        m = AttentionMask.make_causal(s, s, device)
        dropout = nn.Dropout(drop_prob)
        dropout.train(training).cuda()

        with torch.cuda.amp.autocast():
>           r = scaled_dot_product_attention(a, a, a, m)

tests/test_core_attention.py:203:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
xformers/components/attention/core.py:328: in scaled_dot_product_attention
    return blocksparse_attention(q, k, v, dropout, block_size)
xformers/components/attention/core.py:289: in blocksparse_attention
    att = blocksparse_attention(q, k, v)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/torch/nn/modules/module.py:1423: in _call_impl
    return forward_call(*input, **kwargs)
xformers/components/attention/blocksparse.py:179: in forward
    sparse_att_mat = self.sparse_dot_sdd(q, k)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:430: in __call__
    c = _matmul.apply(
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:363: in forward
    c = _matmul.fn[mode](a, b, trans_a, trans_b, trans_c, spdims, block, c_lut, c_width, out=out)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/ops/blocksparse/matmul.py:106: in sdd_matmul
    _sdd_kernel[grid](
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/jit.py:106: in launcher
    return self.run(*args, grid=grid, **kwargs)
/home/username/anaconda3/envs/diffusers/lib/python3.9/site-packages/triton/runtime/autotuner.py:200: in run
    return self.fn.run(*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

A = tensor([[[[1.2858e-01, 4.4812e-02, 1.2768e-01,  ..., 6.1930e-02,
           1.4554e-01, 2.7003e-02],
          [1.6371...         [1.7123e-01, 8.7973e-02, 9.8581e-02,  ..., 1.3706e-01,
           1.7386e-01, 8.6823e-02]]]], device='cuda:0')
B = tensor([[[[7.2737e-01, 2.5350e-01, 7.2227e-01,  ..., 3.5033e-01,
           8.2333e-01, 1.5275e-01],
          [9.2610...         [9.6864e-01, 4.9765e-01, 5.5766e-01,  ..., 7.7530e-01,
           9.8348e-01, 4.9114e-01]]]], device='cuda:0')
C = tensor([[[[0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0...., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.],
          [0., 0., 0.,  ..., 0., 0., 0.]]]], device='cuda:0')
stride_za = 4096, stride_ha = 4096, stride_ma = 32, stride_ak = 1, stride_zb = 4096, stride_hb = 4096, stride_bk = 1, stride_nb = 32, stride_zc = 16384
stride_hc = 16384, stride_mc = 128, stride_nc = 1, K = 32, grid_offset = 0, lut = tensor([[0, 0, 0]], device='cuda:0', dtype=torch.int32), TILE_M = 128
TILE_N = 128, TILE_K = 32, BLOCK = 128, EVEN_K = True, grid = [1, 1, 8], num_warps = 4, num_stages = 4, extern_libs = None, stream = 0, warmup = False

>   ???
E   RuntimeError: Triton Error [CUDA]: invalid argument

<string>:43: RuntimeError
================================================================= short test summary info ==================================================================
FAILED tests/test_core_attention.py::test_switch_blocksparse_dropout[0.0-True-cuda] - RuntimeError: Triton Error [CUDA]: invalid argument
FAILED tests/test_core_attention.py::test_switch_blocksparse_dropout[0.0-False-cuda] - RuntimeError: Triton Error [CUDA]: invalid argument
FAILED tests/test_core_attention.py::test_switch_blocksparse_dims[cuda] - RuntimeError: Triton Error [CUDA]: invalid argument
FAILED tests/test_core_attention.py::test_switch_blocksparse_dropout[0.3-False-cuda] - RuntimeError: Triton Error [CUDA]: invalid argument
FAILED tests/test_core_attention.py::test_switch_blocksparse[data_type1-cuda] - RuntimeError: Triton Error [CUDA]: invalid argument
FAILED tests/test_core_attention.py::test_switch_blocksparse_dropout[0.3-True-cuda] - RuntimeError: Triton Error [CUDA]: invalid argument
============================================================== 6 failed, 11 passed in 35.46s ===============================================================

Command

To Reproduce

Steps to reproduce the behavior:

pytest tests/test_core_attention.py

Environment

python -m torch.utils.collect_env
Collecting environment information...
PyTorch version: 1.14.0.dev20221111
Is debug build: False
CUDA used to build PyTorch: 11.7
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: 10.0.0-4ubuntu1
CMake version: version 3.24.3
Libc version: glibc-2.31

Python version: 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)  [GCC 10.3.0] (64-bit runtime)
Python platform: Linux-5.15.74.2-microsoft-standard-WSL2-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.7.99
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA GeForce RTX 3060 Laptop GPU
Nvidia driver version: 522.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.5.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.5.0
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

Versions of relevant libraries:
[pip3] clip-anytorch==2.5.0
[pip3] colossalai==0.1.11rc2+torch1.14cu11.8
[pip3] mypy-extensions==0.4.3
[pip3] numpy==1.23.4
[pip3] open-clip-torch==2.7.0
[pip3] pytorch-lightning==1.8.0.post1
[pip3] torch==1.14.0.dev20221111
[pip3] torchaudio==0.14.0.dev20221111
[pip3] torchdiffeq==0.2.2
[pip3] torchmetrics==0.10.2
[pip3] torchsde==0.2.5
[pip3] torchvision==0.15.0.dev20221111
[conda] blas                      1.0                         mkl
[conda] clip-anytorch             2.5.0                    pypi_0    pypi
[conda] colossalai                0.1.11rc2+torch1.14cu11.8          pypi_0    pypi
[conda] cudatoolkit               11.7.0              hd8887f6_10    nvidia
[conda] libblas                   3.9.0            16_linux64_mkl    conda-forge
[conda] libcblas                  3.9.0            16_linux64_mkl    conda-forge
[conda] liblapack                 3.9.0            16_linux64_mkl    conda-forge
[conda] liblapacke                3.9.0            16_linux64_mkl    conda-forge
[conda] mkl                       2022.1.0           hc2b9512_224
[conda] numpy                     1.23.4           py39h3d75532_1    conda-forge
[conda] open-clip-torch           2.7.0                    pypi_0    pypi
[conda] pytorch                   1.14.0.dev20221111 py3.9_cuda11.7_cudnn8.5.0_0    pytorch-nightly
[conda] pytorch-cuda              11.7                 h67b0de4_0    pytorch-nightly
[conda] pytorch-lightning         1.8.0.post1              pypi_0    pypi
[conda] pytorch-mutex             1.0                        cuda    pytorch
[conda] torchaudio                0.14.0.dev20221111      py39_cu117    pytorch-nightly
[conda] torchdiffeq               0.2.2              pyhd8ed1ab_0    conda-forge
[conda] torchmetrics              0.10.2                   pypi_0    pypi
[conda] torchsde                  0.2.5                    pypi_0    pypi
[conda] torchvision               0.15.0.dev20221111      py39_cu117    pytorch-nightly
danthe3rd commented 2 years ago

I think I've seen this bug before ... may be due to wrong triton version. Can you make sure you have this one installed? https://github.com/facebookresearch/xformers/blob/main/requirements-test.txt#L30

Thomas-MMJ commented 2 years ago

I've confirmed that it is the triton version specified that is installed. Also reinstalled it to be sure.

pip show triton
Name: triton
Version: 2.0.0.dev20221105

conda list triton
# packages in environment at /home/username/anaconda3/envs/diffusers:
#
# Name                    Version                   Build  Channel
triton                    2.0.0.dev20221105          pypi_0    pypi

ipython
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import triton

In [2]: triton.__version__
Out[2]: '2.0.0'
danthe3rd commented 2 years ago

@blefaudeux do you have any idea what's going on here? This is on RTX 3060, might be an issue of insufficient shared memory? (we have a "CUDA: invalid argument")

blefaudeux commented 2 years ago

@ptillet, any idea ? looks like the code generated for this GPU is invalid somehow. Thanks for the heads up @danthe3rd, sorry about that

Thomas-MMJ commented 2 years ago

If I run the unit tests for triton using the pip installed version I get a similar fail in test mat_mul, will try building it from source, maybe it has to do with the pip version.

Edit - these occur only when pytest-randomly randomly orders the tests.

FAILED test/unit/operators/test_matmul.py::test_op[256-128-32-1-8-3-1024-1024-1024-False-True-float32] - RuntimeError: Triton Error [CUDA]: invalid argument FAILED test/unit/operators/test_matmul.py::test_op[128-128-32-1-4-4-384-128-640-True-False-float32] - RuntimeError: Triton Error [CUDA]: invalid argument FAILED test/unit/operators/test_matmul.py::test_op[128-128-32-1-4-4-384-128-640-False-False-float32] - RuntimeError: Triton Error [CUDA]: invalid argument FAILED test/unit/operators/test_matmul.py::test_op[256-128-32-1-8-4-1024-1024-1024-False-False-float32] - RuntimeError: Triton Error [CUDA]: invalid argument

Edit - I get the same unit test fails running the nightly build for triton installed via pip install -U --pre triton, version triton-2.0.0.dev20221120-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl

So looks like I should file this with triton instead?

Thomas-MMJ commented 2 years ago

@blefaudeux do you have any idea what's going on here? This is on RTX 3060, might be an issue of insufficient shared memory? (we have a "CUDA: invalid argument")

Note that memory usage never goes above about 2 GB of VRAM so unlikely that is the case.

danthe3rd commented 2 years ago

will try building it from source, maybe it has to do with the pip version

I don't think this is related, as these kernels are built at run-time.

@blefaudeux do you have any idea what's going on here? This is on RTX 3060, might be an issue of insufficient shared memory? (we have a "CUDA: invalid argument")

Note that memory usage never goes above about 2 GB of VRAM so unlikely that is the case.

I was not mentioning GPU global memory, but GPU shared-memory - that's a sort of very fast cache that kernels can use to store stuff (like matrix operands for GEMM)

blefaudeux commented 2 years ago

will try building it from source, maybe it has to do with the pip version

I don't think this is related, as these kernels are built at run-time.

@blefaudeux do you have any idea what's going on here? This is on RTX 3060, might be an issue of insufficient shared memory? (we have a "CUDA: invalid argument")

Note that memory usage never goes above about 2 GB of VRAM so unlikely that is the case.

I was not mentioning GPU global memory, but GPU shared-memory - that's a sort of very fast cache that kernels can use to store stuff (like matrix operands for GEMM)

normally triton should accommodate for this in real time when jitting the kernel, the JIT part can even get a little long if there's a lot of spilling and the compiler has to find a solution which fits. It looks (could be wrong) like a case of triton producing an instruction that this card doesn't support, could also be a bad (unfortunate) combination of nvcc/cuda, I'm not sure

Thomas-MMJ commented 2 years ago

I just uninstalled pytest-randomly and the triton mat_mul unit tests pass, but the test_core_attention.py tests are still failing.

blefaudeux commented 2 years ago

alright, I got some explanations from @ptillet and you're right @danthe3rd, the kernel needs too much shared memory. We can use a smaller block size &| lower the num_stages when kicking the kernel