[BUG] CUDA Error: invalid device function /tmp/pip-req-build-5rlg4jgm/ln_fwd_kernels.cuh 236

taoqinghua commented 5 months ago

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

[X] 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

[X] 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在容器中运行bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/test_docker/Qwen-7B-Chat-Int4/ -d /data/shared/test_docker/chat.json 时报错，显示CUDA Error: invalid device function /tmp/pip-req-build-5rlg4jgm/ln_fwd_kernels.cuh 236错误，cuda版本是11.7，更换成11.8、12.1、12.4版本的cuda显示还是同样错误。容器中找不到/tmp/pip-req-build-5rlg4jgm/ln_fwd_kernels.cuh这个文件，不知道具体原因是什么，有大神指点一二吗。

期望行为 | Expected Behavior

到底是什么原因，更换了多个cuda版本都显示同样的错误，求大神指点。

复现方法 | Steps To Reproduce

1、启动docker docker run -itd -v /***://data/shared/test_docker --name test_qwen --gpus all --shm-size 12G qwenllm/qwen /bin/bash 2、进入docker docker exec -it test_qwen04 /bin/bash 3、运行指令 bash finetune/finetune_qlora_single_gpu.sh -m /data/shared/test_docker/Qwen-7B-Chat-Int4/ -d /data/shared/test_docker/chat.json

运行环境 | Environment

- OS:ubuntu
- Python:
- Transformers:
- PyTorch:2.0.1
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):11.7

备注 | Anything else?

No response

jklj077 commented 5 months ago

If you are using the provided the docker image with tag qwenllm/qwen(:latest), it is based on CUDA 11.7 and bundles the layer_norm module from flash attention v2, where that invalid device function (cudaOccupancyMaxActiveBlocksPerMultiprocessor which is a CUDA runtime API) is called.

It is likely your nvidia driver is too old to support CUDA 11.7 (and later versions). Please run nvidia-smi and provide the result.

taoqinghua commented 5 months ago

If you are using the provided the docker image with tag qwenllm/qwen(:latest), it is based on CUDA 11.7 and bundles the layer_norm module from flash attention v2, where that invalid device function (cudaOccupancyMaxActiveBlocksPerMultiprocessor which is a CUDA runtime API) is called.

It is likely your nvidia driver is too old to support CUDA 11.7 (and later versions). Please run nvidia-smi and provide the result.

+---------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=======================================================================================| | No running processes found | +---------------------------------------------------------------------------------------+

jklj077 commented 5 months ago

Unfortunately, flash attention v2 does not support P100 (nor V100). You may need to uninstall the related packages in the image (pip uninstall flash_attn dropout_layer_norm) or build the image from scratch and set environment variable BUNDLE_FLASH_ATTENTION to false.

taoqinghua commented 5 months ago

Unfortunately, flash attention v2 does not support P100 (nor V100). You may need to uninstall the related packages in the image (pip uninstall flash_attn dropout_layer_norm) or build the image from scratch and set environment variable BUNDLE_FLASH_ATTENTION to false.

谢谢。

QwenLM / Qwen