Open yanchenmochen opened 2 months ago
The command I use is simple as follows:
/usr/local/bin/lm-eval --model vllm --model_args pretrained=/mnt/self-define/songquanheng/model/opt-6.7b,tensor_parallel_size=1,gpu_memory_utilization=0.8 --tasks lambada_openai,arc_easy,piqa --device cuda:0
The code releated with this error is
@staticmethod
def _parse_logprobs(tokens: List, outputs, ctxlen: int) -> Tuple[float, bool]:
"""Process logprobs and tokens.
:param tokens: list
Input tokens (potentially left-truncated)
:param outputs: RequestOutput
Contains prompt_logprobs
:param ctxlen: int
Length of context (so we can slice them away and only keep the predictions)
:return:
continuation_logprobs: float
Log probabilities of continuation tokens
is_greedy: bool
Whether argmax matches given continuation exactly
"""
# The first entry of prompt_logprobs is None because the model has no previous tokens to condition on.
continuation_logprobs_dicts = outputs.prompt_logprobs
def coerce_logprob_to_num(logprob):
# vLLM changed the return type of logprobs from float
# to a Logprob object storing the float value + extra data
# (https://github.com/vllm-project/vllm/pull/3065).
# If we are dealing with vllm's Logprob object, return
# the logprob value stored as an attribute. Otherwise,
# return the object itself (which should be a float
# for older versions of vLLM).
return getattr(logprob, "logprob", logprob)
continuation_logprobs_dicts = [
{
token: coerce_logprob_to_num(logprob)
for token, logprob in logprob_dict.items()
}
if logprob_dict is not None
else None
for logprob_dict in continuation_logprobs_dicts
]
# Calculate continuation_logprobs
# assume ctxlen always >= 1
continuation_logprobs = sum(
logprob_dict.get(token)
for token, logprob_dict in zip(
tokens[ctxlen:], continuation_logprobs_dicts[ctxlen:]
)
)
so what is wrong with this execution, and what can i do to solve this problem
Hi, is there any more to the error output than what you've shared?
Does your locally saved OPT model differ at all from the one downloaded directly from HF?
I ran the command you provided on my machine and did not replicate the error you were getting.
Yes, I save the OPT model at local directory. when I use the lm-eval command, lm-eval --tasks list cannot generate valid outputs.
and I tried to install lm-evaluate-harness, after cloing the repostiroy, I run the command "pip insall -e . ", the command will insall "UNKNOWN 0.0.0", at the same time, the executing command "lm-eval and lm_eval" does not generate successfully. donot know the reason
when using the model value of "hf", it will work well.
root@145206f3e691:/mnt/self-define/sunning/lmdeploy/vllm_test# lm_eval --model vllm --model_args pretrained=/mnt/self-define/songquanheng/model/opt-6.7b --tasks arc_easy --device cuda:0
INFO 08-06 02:41:19 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='/mnt/self-define/songquanheng/model/opt-6.7b', speculative_config=None, tokenizer='/mnt/self-define/songquanheng/model/opt-6.7b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=1234, served_model_name=/mnt/self-define/songquanheng/model/opt-6.7b)
INFO 08-06 02:41:19 selector.py:37] Using FlashAttention-2 backend.
INFO 08-06 02:41:28 model_runner.py:145] Loading model weights took 12.4036 GB
INFO 08-06 02:41:29 gpu_executor.py:83] # GPU blocks: 2816, # CPU blocks: 512
INFO 08-06 02:41:34 model_runner.py:824] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-06 02:41:34 model_runner.py:828] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-06 02:41:55 model_runner.py:894] Graph capturing finished in 21 secs.
100%|█████████████████████████████████████████████████████████████████████████████████████████████| 2376/2376 [00:01<00:00, 1403.64it/s]
Running loglikelihood requests: 0%| | 0/9501 [00:00<?, ?it/s][rank0]: Traceback (most recent call last):
[rank0]: File "/usr/local/bin/lm_eval", line 8, in <module>
[rank0]: sys.exit(cli_evaluate())
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lm_eval/__main__.py", line 375, in cli_evaluate
[rank0]: results = evaluator.simple_evaluate(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lm_eval/utils.py", line 395, in _wrapper
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 277, in simple_evaluate
[rank0]: results = evaluate(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lm_eval/utils.py", line 395, in _wrapper
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lm_eval/evaluator.py", line 449, in evaluate
[rank0]: resps = getattr(lm, reqtype)(cloned_reqs)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lm_eval/api/model.py", line 371, in loglikelihood
[rank0]: return self._loglikelihood_tokens(new_reqs, disable_tqdm=disable_tqdm)
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/vllm_causallms.py", line 448, in _loglikelihood_tokens
[rank0]: answer = self._parse_logprobs(
[rank0]: File "/usr/local/lib/python3.10/dist-packages/lm_eval/models/vllm_causallms.py", line 493, in _parse_logprobs
[rank0]: continuation_logprobs_dicts = [
[rank0]: TypeError: 'NoneType' object is not iterable
Running loglikelihood requests: 0%| | 0/9501 [00:00<?, ?it/s]
root@145206f3e691:/mnt/self-define/sunning/lmdeploy/vllm_test#
it is the running output. I donot know why.
@haileyschoelkopf This problem has failed me somedays, Do you know
answer = self._parse_logprobs(
tokens=inp,
outputs=output,
ctxlen=ctxlen,
)
def _parse_logprobs(tokens: List, outputs, ctxlen: int) -> Tuple[float, bool]:
"""Process logprobs and tokens.
:param tokens: list
Input tokens (potentially left-truncated)
:param outputs: RequestOutput
Contains prompt_logprobs
:param ctxlen: int
Length of context (so we can slice them away and only keep the predictions)
:return:
continuation_logprobs: float
Log probabilities of continuation tokens
is_greedy: bool
Whether argmax matches given continuation exactly
"""
# The first entry of prompt_logprobs is None because the model has no previous tokens to condition on.
continuation_logprobs_dicts = outputs.prompt_logprobs
outputs.prompt_logprobs represents what meaning?
ns/ms-python.debugpy-2024.0.0-linux-x64/bundled/libs/debugpy/adapter/../../debugpy/launcher 32781 -- /usr/local/bin/lm-eval --model vllm --model_args pretrained=/mnt/self-define/zhangweixing/model/llama2-7b-hf,gpu_memory_utilization=0.8 --tasks arc_easy --device cuda:0
INFO 08-06 09:29:33 llm_engine.py:103] Initializing an LLM engine (v0.4.2) with config: model='/mnt/self-define/zhangweixing/model/llama2-7b-hf', speculative_config=None, tokenizer='/mnt/self-define/zhangweixing/model/llama2-7b-hf', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=1234, served_model_name=/mnt/self-define/zhangweixing/model/llama2-7b-hf)
INFO 08-06 09:29:33 selector.py:37] Using FlashAttention-2 backend.
INFO 08-06 09:29:41 model_runner.py:145] Loading model weights took 12.5523 GB
INFO 08-06 09:29:42 gpu_executor.py:83] # GPU blocks: 2321, # CPU blocks: 512
INFO 08-06 09:29:44 model_runner.py:824] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 08-06 09:29:44 model_runner.py:828] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
INFO 08-06 09:29:51 model_runner.py:894] Graph capturing finished in 8 secs.
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2376/2376 [00:07<00:00, 327.01it/s]
Running loglikelihood requests: 0%| | 0/9501 [00:00<?, ?it/s]
在运行lolikelihood requests,outputs.prompt_logprobs is None, I tried the LLama 2 7B, encounter same question
vllm version 0.4.2. I tried at another environment which is not container env, but also encounter this problem