eosphoros-ai / DB-GPT

AI Native Data App Development framework with AWEL(Agentic Workflow Expression Language) and Agents
http://docs.dbgpt.cn
MIT License
13.17k stars 1.74k forks source link

Question:RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable #189

Open alex198208 opened 1 year ago

alex198208 commented 1 year ago

python pilot/server/llmserver.py playsound is relying on another python subprocess. Please use pip install pygobject if you want playsound to run more efficiently. localhost:19530 None None dbgpt False 127.0.0.1 2023-06-12 16:40:21,178 INFO sqlalchemy.engine.Engine SELECT DATABASE() 2023-06-12 16:40:21,178 INFO sqlalchemy.engine.Engine [raw sql] {} 2023-06-12 16:40:21,179 INFO sqlalchemy.engine.Engine SELECT @@sql_mode 2023-06-12 16:40:21,179 INFO sqlalchemy.engine.Engine [raw sql] {} 2023-06-12 16:40:21,180 INFO sqlalchemy.engine.Engine SELECT @@lower_case_table_names 2023-06-12 16:40:21,180 INFO sqlalchemy.engine.Engine [raw sql] {} /data/DB-GPT/models/vicuna-13b cuda Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:12<00:00, 4.33s/it] ╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮ │ /data/DB-GPT/pilot/server/llmserver.py:163 in │ │ │ │ 160 │ model_path = LLM_MODEL_CONFIG[CFG.LLM_MODEL] │ │ 161 │ print(model_path, DEVICE) │ │ 162 │ │ │ ❱ 163 │ worker = ModelWorker( │ │ 164 │ │ model_path=model_path, model_name=CFG.LLM_MODEL, device=DEVICE, num_gpus=1 │ │ 165 │ ) │ │ 166 │ │ │ │ /data/DB-GPT/pilot/server/llmserver.py:37 in init │ │ │ │ 34 │ │ self.device = device │ │ 35 │ │ │ │ 36 │ │ self.ml = ModelLoader(model_path=model_path) │ │ ❱ 37 │ │ self.model, self.tokenizer = self.ml.loader( │ │ 38 │ │ │ num_gpus, load_8bit=ISLOAD_8BIT, debug=ISDEBUG │ │ 39 │ │ ) │ │ 40 │ │ │ │ /data/DB-GPT/pilot/model/loader.py:109 in loader │ │ │ │ 106 │ │ │ │ │ "8-bit quantization is not supported for multi-gpu inference" │ │ 107 │ │ │ │ ) │ │ 108 │ │ │ else: │ │ ❱ 109 │ │ │ │ compress_module(model, self.device) │ │ 110 │ │ │ │ 111 │ │ if ( │ │ 112 │ │ │ (self.device == "cuda" and num_gpus == 1 and not cpu_offloading) │ │ │ │ /data/DB-GPT/pilot/model/compression.py:48 in compress_module │ │ │ │ 45 │ │ │ setattr( │ │ 46 │ │ │ │ module, │ │ 47 │ │ │ │ attr_str, │ │ ❱ 48 │ │ │ │ CLinear(target_attr.weight, target_attr.bias, target_device), │ │ 49 │ │ │ ) │ │ 50 │ for name, child in module.named_children(): │ │ 51 │ │ compress_module(child, target_device) │ │ │ │ /data/DB-GPT/pilot/model/compression.py:33 in init │ │ │ │ 30 │ def init(self, weight, bias, device): │ │ 31 │ │ super().init() │ │ 32 │ │ │ │ ❱ 33 │ │ self.weight = compress(weight.data.to(device), default_compression_config) │ │ 34 │ │ self.bias = bias │ │ 35 │ │ │ 36 │ def forward(self, input: Tensor) -> Tensor: │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

遇到以上错误,有遇到过的么?

alex198208 commented 1 year ago

image

csunny commented 1 year ago

which version of your torch?maybe you can try upgrade torch version.

alex198208 commented 1 year ago
image

(base) [root@gpu ~]# nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Mon_Oct_24_19:12:58_PDT_2022 Cuda compilation tools, release 12.0, V12.0.76 Build cuda_12.0.r12.0/compiler.31968024_0

alex198208 commented 1 year ago

./deviceQuery 测试结果是pass (base) [root@gpu demo_suite]# ./bandwidthTest [CUDA Bandwidth Test] - Starting... Running on...

Device 0: Tesla T4 Quick Mode

CUDA error at /dvs/p4/build/sw/rel/gpgpu/toolkit/r12.0/demo_suite/bandwidthTest/bandwidthTest.cu:756 code=801(cudaErrorNotSupported) "cudaEventCreate(&start)"

which version of your torch?maybe you can try upgrade torch version.

alex198208 commented 1 year ago

maybe something wrong with CUDA,but i have no idear

yongzheJIN commented 4 months ago

Have you sloved this? I got this bug either