RuntimeError: CUDA error: device-side assert triggered

Yazooliu commented 1 year ago

Hi Team，

I meet the error during run rum_demo.py.

OS Environment: Centos python version: Python 3.10.10 (main, Mar 21 2023, 18:45:11) [GCC 11.2.0] on linux Type "help", "copyright", "credits" or "license" for more information.

StartUp method: nohup python run_demo.py

Error info in nohup.out Traceback (most recent call last): File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/gradio/routes.py", line 442, in run_predict output = await app.get_blocks().process_api( File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/gradio/blocks.py", line 1392, in process_api result = await self.call_function( File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/gradio/blocks.py", line 1097, in call_function prediction = await anyio.to_thread.run_sync( File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/anyio/to_thread.py", line 33, in run_sync return await get_asynclib().run_sync_in_worker_thread( File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 877, in run_sync_in_worker_thread return await future File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/anyio/_backends/_asyncio.py", line 807, in run result = context.run(func, args) File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/gradio/utils.py", line 703, in wrapper response = f(args, **kwargs) File "/home/llm/app/CodeGeeX2-6B/source/CodeGeeX2/run_demo_CodeGeeX2.py", line 117, in predict set_random_seed(seed) File "/home/llm/app/CodeGeeX2-6B/source/CodeGeeX2/run_demo_CodeGeeX2.py", line 104, in set_random_seed torch.manual_seed(seed) File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/torch/random.py", line 40, in manual_seed torch.cuda.manual_seed_all(seed) File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/torch/cuda/random.py", line 113, in manual_seed_all _lazy_call(cb, seed_all=True) File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/torch/cuda/init.py", line 183, in _lazy_call callable() File "/home/llm/miniconda3/envs/CodeGeeX2_env/lib/python3.10/site-packages/torch/cuda/random.py", line 111, in cb default_generator.manual_seed(seed) RuntimeError: CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I try the method as following: add CUDA_LAUNCH_BLOCKING=1 in run_demo.py

I am not sure the reason and the method is correct or not?? If correct, could I pull the PR?

BestRegards Yazhou

Stanislas0 commented 1 year ago

Which torch version do you use? And the CUDA version?

Yazooliu commented 1 year ago

Which torch version do you use? And the CUDA version?

show you the pip list detail info as following: Package Version

accelerate 0.21.0 aiofiles 23.1.0 aiohttp 3.8.5 aiosignal 1.3.1 altair 5.0.1 annotated-types 0.5.0 anyio 3.7.1 asttokens 2.2.1 async-timeout 4.0.2 attrs 23.1.0 backcall 0.2.0 certifi 2023.7.22 charset-normalizer 3.2.0 click 8.1.6 cmake 3.27.0 contourpy 1.1.0 cpm-kernels 1.0.11 cycler 0.11.0 decorator 5.1.1 exceptiongroup 1.1.2 executing 1.2.0 fastapi 0.100.0 ffmpy 0.3.1 filelock 3.12.2 fonttools 4.41.1 frozenlist 1.4.0 fsspec 2023.6.0 gradio 3.39.0 gradio_client 0.3.0 h11 0.14.0 httpcore 0.17.3 httpx 0.24.1 huggingface-hub 0.16.4 idna 3.4 ipython 8.14.0 jedi 0.18.2 Jinja2 3.1.2 jsonschema 4.18.4 jsonschema-specifications 2023.7.1 kiwisolver 1.4.4 latex2mathml 3.76.0 linkify-it-py 2.0.2 lit 16.0.6 Markdown 3.4.4 markdown-it-py 2.2.0 MarkupSafe 2.1.3 matplotlib 3.7.2 matplotlib-inline 0.1.6 mdit-py-plugins 0.3.3 mdtex2html 1.2.0 mdurl 0.1.2 mpmath 1.3.0 multidict 6.0.4 networkx 3.1 numpy 1.25.1 nvidia-cublas-cu11 11.10.3.66 nvidia-cuda-cupti-cu11 11.7.101 nvidia-cuda-nvrtc-cu11 11.7.99 nvidia-cuda-runtime-cu11 11.7.99 nvidia-cudnn-cu11 8.5.0.96 nvidia-cufft-cu11 10.9.0.58 nvidia-curand-cu11 10.2.10.91 nvidia-cusolver-cu11 11.4.0.1 nvidia-cusparse-cu11 11.7.4.91 nvidia-nccl-cu11 2.14.3 nvidia-nvtx-cu11 11.7.91 orjson 3.9.2 packaging 23.1 pandas 2.0.3 parso 0.8.3 pexpect 4.8.0 pickleshare 0.7.5 Pillow 10.0.0 pip 23.1.2 prompt-toolkit 3.0.39 protobuf 4.23.4 psutil 5.9.5 ptyprocess 0.7.0 pure-eval 0.2.2 pydantic 1.10.9 pydantic_core 2.3.0 pydub 0.25.1 Pygments 2.15.1 pyparsing 3.0.9 python-dateutil 2.8.2 python-multipart 0.0.6 pytz 2023.3 PyYAML 6.0.1 referencing 0.30.0 regex 2023.6.3 requests 2.31.0 rpds-py 0.9.2 safetensors 0.3.1 semantic-version 2.10.0 sentencepiece 0.1.99 setuptools 67.8.0 six 1.16.0 sniffio 1.3.0 sse-starlette 1.6.1 stack-data 0.6.2 starlette 0.27.0 sympy 1.12 tokenizers 0.13.3 toolz 0.12.0 torch 2.0.1 tqdm 4.65.0 traitlets 5.9.0 transformers 4.30.2 triton 2.0.0 typing_extensions 4.7.1 tzdata 2023.3 uc-micro-py 1.0.2 urllib3 2.0.4 uvicorn 0.23.1 wcwidth 0.2.6 websockets 11.0.3 wheel 0.38.4 yarl 1.9.2

NVIDIA-SMI 510.108.03 Driver Version: 510.108.03 CUDA Version: 11.6 V100 32G GPU/Card

Yazooliu commented 1 year ago

Which torch version do you use? And the CUDA version?

Could I PR to fix this issue ? Thanks

BestRegards Yazhou

Yazooliu commented 1 year ago

PR is arise to fix this issue , please review and verify

Yazooliu commented 8 months ago

No response ？ for my PR?

THUDM / CodeGeeX2

RuntimeError: CUDA error: device-side assert triggered #42