/home/jake/anaconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
from vllm.version import version as VLLM_VERSION
INFO 10-14 22:51:57 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/home/jake/LLaMA-Factory/finetunes', speculative_config=None, tokenizer='/home/jake/LLaMA-Factory/finetunes', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/jake/LLaMA-Factory/finetunes, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
INFO 10-14 22:51:59 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-14 22:51:59 selector.py:115] Using XFormers backend.
/home/jake/anaconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_fwd")
/home/jake/anaconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: torch.library.impl_abstract was renamed to torch.library.register_fake. Please use that instead; we will remove torch.library.impl_abstract in a future version of PyTorch.
@torch.library.impl_abstract("xformers_flash::flash_bwd")
INFO 10-14 22:52:00 model_runner.py:1060] Starting to load model /home/jake/LLaMA-Factory/finetunes...
INFO 10-14 22:52:00 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 10-14 22:52:00 selector.py:115] Using XFormers backend.
Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.95s/it]
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.95s/it]
INFO 10-14 22:52:02 model_runner.py:1071] Loading model weights took 2.3185 GB
INFO 10-14 22:52:02 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-225202.pkl...
INFO 10-14 22:52:02 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-225202.pkl.
rank0: Traceback (most recent call last):
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper
rank0: return func(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1665, in execute_model
rank0: hidden_or_intermediate_states = model_executable(
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 556, in forward
rank0: model_output = self.model(input_ids, positions, kv_caches,
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 345, in forward
rank0: hidden_states, residual = layer(positions, hidden_states,
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 257, in forward
rank0: hidden_states = self.self_attn(positions=positions,
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/modelexecutor/models/llama.py", line 184, in forward
rank0: qkv, = self.qkv_proj(hidden_states)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 371, in forward
rank0: output_parallel = self.quantmethod.apply(self, input, bias)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 135, in apply
rank0: return F.linear(x, layer.weight, bias)
rank0: RuntimeError: CUDA error: no kernel image is available for execution on the device
rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
rank0: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
rank0: The above exception was the direct cause of the following exception:
rank0: Traceback (most recent call last):
rank0: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 284, in
rank0: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 200, in main
rank0: model, tokenizer = load_model()
rank0: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 30, in load_model
rank0: llm = LLM(model=args.model, gpu_memory_utilization=float(args.gpu_util),
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 177, in initrank0: self.llm_engine = LLMEngine.from_engine_args(
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 574, in from_engine_args
rank0: engine = cls(
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 349, in init
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks
rank0: return self.driver_worker.determine_num_available_blocks()
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1309, in profile_run
rank0: self.execute_model(model_input, kv_caches, intermediate_tensors)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
rank0: return func(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper
rank0: raise type(err)(
rank0: RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-225202.pkl): CUDA error: no kernel image is available for execution on the device
rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
rank0: Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.
/home/jake/anaconda3/lib/python3.12/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash: No module named 'vllm._version' from vllm.version import version as VLLM_VERSION INFO 10-14 22:51:57 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='/home/jake/LLaMA-Factory/finetunes', speculative_config=None, tokenizer='/home/jake/LLaMA-Factory/finetunes', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/home/jake/LLaMA-Factory/finetunes, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None) INFO 10-14 22:51:59 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 10-14 22:51:59 selector.py:115] Using XFormers backend. /home/jake/anaconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning:
torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_fwd") /home/jake/anaconda3/lib/python3.12/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning:torch.library.impl_abstract
was renamed totorch.library.register_fake
. Please use that instead; we will removetorch.library.impl_abstract
in a future version of PyTorch. @torch.library.impl_abstract("xformers_flash::flash_bwd") INFO 10-14 22:52:00 model_runner.py:1060] Starting to load model /home/jake/LLaMA-Factory/finetunes... INFO 10-14 22:52:00 selector.py:224] Cannot use FlashAttention-2 backend for Volta and Turing GPUs. INFO 10-14 22:52:00 selector.py:115] Using XFormers backend. Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.95s/it] Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:01<00:00, 1.95s/it]INFO 10-14 22:52:02 model_runner.py:1071] Loading model weights took 2.3185 GB INFO 10-14 22:52:02 model_runner_base.py:120] Writing input of failed execution to /tmp/err_execute_model_input_20241014-225202.pkl... INFO 10-14 22:52:02 model_runner_base.py:149] Completed writing input of failed execution to /tmp/err_execute_model_input_20241014-225202.pkl. rank0: Traceback (most recent call last): rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 116, in _wrapper rank0: return func(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1665, in execute_model rank0: hidden_or_intermediate_states = model_executable(
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 556, in forward rank0: model_output = self.model(input_ids, positions, kv_caches,
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 345, in forward rank0: hidden_states, residual = layer(positions, hidden_states,
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 257, in forward rank0: hidden_states = self.self_attn(positions=positions,
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/modelexecutor/models/llama.py", line 184, in forward rank0: qkv, = self.qkv_proj(hidden_states)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 371, in forward rank0: output_parallel = self.quantmethod.apply(self, input, bias)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/model_executor/layers/linear.py", line 135, in apply rank0: return F.linear(x, layer.weight, bias)
rank0: RuntimeError: CUDA error: no kernel image is available for execution on the device rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1 rank0: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.rank0: The above exception was the direct cause of the following exception:
rank0: Traceback (most recent call last): rank0: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 284, in
rank0: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 200, in main rank0: model, tokenizer = load_model()
rank0: File "/home/jake/Downloads/MMLU-Pro-main/evaluate_from_local.py", line 30, in load_model rank0: llm = LLM(model=args.model, gpu_memory_utilization=float(args.gpu_util),
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/entrypoints/llm.py", line 177, in init rank0: self.llm_engine = LLMEngine.from_engine_args(
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 574, in from_engine_args rank0: engine = cls(
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 349, in init
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 484, in _initialize_kv_caches
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 114, in determine_num_available_blocks rank0: return self.driver_worker.determine_num_available_blocks()
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 223, in determine_num_available_blocks
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1309, in profile_run rank0: self.execute_model(model_input, kv_caches, intermediate_tensors) rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context rank0: return func(*args, **kwargs)
rank0: File "/home/jake/anaconda3/lib/python3.12/site-packages/vllm/worker/model_runner_base.py", line 152, in _wrapper rank0: raise type(err)( rank0: RuntimeError: Error in model execution (input dumped to /tmp/err_execute_model_input_20241014-225202.pkl): CUDA error: no kernel image is available for execution on the device rank0: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. rank0: For debugging consider passing CUDA_LAUNCH_BLOCKING=1 rank0: Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Running on Debian Linux Cuda 12.6 is installed