Open Thomas-MMJ opened 2 years ago
I think I've seen this bug before ... may be due to wrong triton version. Can you make sure you have this one installed? https://github.com/facebookresearch/xformers/blob/main/requirements-test.txt#L30
I've confirmed that it is the triton version specified that is installed. Also reinstalled it to be sure.
pip show triton
Name: triton
Version: 2.0.0.dev20221105
conda list triton
# packages in environment at /home/username/anaconda3/envs/diffusers:
#
# Name Version Build Channel
triton 2.0.0.dev20221105 pypi_0 pypi
ipython
Python 3.9.13 | packaged by conda-forge | (main, May 27 2022, 16:56:21)
Type 'copyright', 'credits' or 'license' for more information
IPython 8.6.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import triton
In [2]: triton.__version__
Out[2]: '2.0.0'
@blefaudeux do you have any idea what's going on here? This is on RTX 3060, might be an issue of insufficient shared memory? (we have a "CUDA: invalid argument")
@ptillet, any idea ? looks like the code generated for this GPU is invalid somehow. Thanks for the heads up @danthe3rd, sorry about that
If I run the unit tests for triton using the pip installed version I get a similar fail in test mat_mul, will try building it from source, maybe it has to do with the pip version.
Edit - these occur only when pytest-randomly randomly orders the tests.
FAILED test/unit/operators/test_matmul.py::test_op[256-128-32-1-8-3-1024-1024-1024-False-True-float32] - RuntimeError: Triton Error [CUDA]: invalid argument FAILED test/unit/operators/test_matmul.py::test_op[128-128-32-1-4-4-384-128-640-True-False-float32] - RuntimeError: Triton Error [CUDA]: invalid argument FAILED test/unit/operators/test_matmul.py::test_op[128-128-32-1-4-4-384-128-640-False-False-float32] - RuntimeError: Triton Error [CUDA]: invalid argument FAILED test/unit/operators/test_matmul.py::test_op[256-128-32-1-8-4-1024-1024-1024-False-False-float32] - RuntimeError: Triton Error [CUDA]: invalid argument
Edit - I get the same unit test fails running the nightly build for triton installed via pip install -U --pre triton, version triton-2.0.0.dev20221120-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
So looks like I should file this with triton instead?
@blefaudeux do you have any idea what's going on here? This is on RTX 3060, might be an issue of insufficient shared memory? (we have a "CUDA: invalid argument")
Note that memory usage never goes above about 2 GB of VRAM so unlikely that is the case.
will try building it from source, maybe it has to do with the pip version
I don't think this is related, as these kernels are built at run-time.
@blefaudeux do you have any idea what's going on here? This is on RTX 3060, might be an issue of insufficient shared memory? (we have a "CUDA: invalid argument")
Note that memory usage never goes above about 2 GB of VRAM so unlikely that is the case.
I was not mentioning GPU global memory, but GPU shared-memory - that's a sort of very fast cache that kernels can use to store stuff (like matrix operands for GEMM)
will try building it from source, maybe it has to do with the pip version
I don't think this is related, as these kernels are built at run-time.
@blefaudeux do you have any idea what's going on here? This is on RTX 3060, might be an issue of insufficient shared memory? (we have a "CUDA: invalid argument")
Note that memory usage never goes above about 2 GB of VRAM so unlikely that is the case.
I was not mentioning GPU global memory, but GPU shared-memory - that's a sort of very fast cache that kernels can use to store stuff (like matrix operands for GEMM)
normally triton should accommodate for this in real time when jitting the kernel, the JIT part can even get a little long if there's a lot of spilling and the compiler has to find a solution which fits. It looks (could be wrong) like a case of triton producing an instruction that this card doesn't support, could also be a bad (unfortunate) combination of nvcc/cuda, I'm not sure
I just uninstalled pytest-randomly and the triton mat_mul unit tests pass, but the test_core_attention.py tests are still failing.
alright, I got some explanations from @ptillet and you're right @danthe3rd, the kernel needs too much shared memory. We can use a smaller block size &| lower the num_stages
when kicking the kernel
🐛 Bug
In test_core_attention the tests test_switch_blocksparse_dropout[0.0-True-cuda], test_switch_blocksparse_dropout[0.0-False-cuda], test_switch_blocksparse_dims[cuda], test_switch_blocksparse_dropout[0.3-False-cuda], test_switch_blocksparse[data_type1-cuda] , test_switch_blocksparse_dropout[0.3-True-cuda] all fail.
here is the output,
Command
To Reproduce
Steps to reproduce the behavior:
pytest tests/test_core_attention.py
Environment