HazyResearch / H3

Language Modeling with the H3 State Space Model
Apache License 2.0
511 stars 53 forks source link

ERROR: CUDA RT call cudaFuncSetAttribute. Failed with invalid device function (98). #27

Closed wang-zerui closed 1 year ago

wang-zerui commented 1 year ago

I run into this error when I train the model with use_fast_fftconv.

ERROR: CUDA RT call "cudaFuncSetAttribute(kernel, cudaFuncAttributeMaxDynamicSharedMemorySize, shared_memory_size )" in line 809 of file /***********/H3/csrc/fftconv/fftconv_cuda.cu failed with invalid device function (98).

This error actually doesn't stop the training process, but the result of the conv op is wrong. I also run PYTHONPATH=$(pwd) pytest tests/

PYTHONPATH=$(pwd) pytest tests/
========================================================================================================================= test session starts =========================================================================================================================
platform linux -- Python 3.8.16, pytest-7.4.0, pluggy-1.2.0
rootdir: /mnt/cache/wangzerui/H3-origin/H3
plugins: anyio-3.6.2
collected 4160 items                                                                                                                                                                                                                                                  

tests/ops/test_fftconv.py FFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFFFFFFFFFssssFFFFFFFFFFFFF [  5%]
FFFFKilled

I have installed the fftconv by running

cd csrc/cauchy && pip install . && cd ../../ \
    && cd csrc/fftconv && pip install . && cd ../../ \
    && cd .. && rm -rf csrc
tridao commented 1 year ago

I haven't seen this, but googling seems to suggest it's because there’s a mismatch between the CUDA version the binary was compiled to and the CUDA version of the device. Maybe the solution is to uninstall the fftconv extension, then make sure to reinstall it with the right CUDA version.

wang-zerui commented 1 year ago

Solved after I use another cluster with a newer driver.

current output of nvidia-smi:

Every 1.0s: nvidia-smi                                                                                    Mon Aug  7 19:15:39 2023

Mon Aug  7 19:15:39 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.60.13    Driver Version: 525.60.13    CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+

Previous output:

Every 1.0s: nvidia-smi                                                                                    Mon Aug  7 19:19:25 2023

Mon Aug  7 19:19:25 2023
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.129.06   Driver Version: 470.129.06   CUDA Version: 11.4     |