deepjavalibrary / djl-serving

A universal scalable machine learning model deployment solution
Apache License 2.0
198 stars 64 forks source link

djl-inference:0.29.0-tensorrtllm0.11.0-cu124 regression: has no attribute 'to_word_list_format' #2293

Open lxning opened 2 months ago

lxning commented 2 months ago

Description

(A clear and concise description of what the bug is.)

There are 2 different behavior in LMI trtllm containers during testing gsm8k dataset via lm_eval_harness on model llama-2-7b.

Expected Behavior

(what's the expected behavior?) Expect lm_eval_harness is able to generate report when the djl-inference:0.29.0-tensorrtllm0.11.0-cu124 is applied.

Error Message

(Paste the complete error message, including stack trace.)

Error log in djl-inference:0.29.0-tensorrtllm0.11.0-cu124

[INFO ] 2024-08-07 17:45:57 ModelServer - Initialize BOTH server with: EpollServerSocketChannel.
[INFO ] 2024-08-07 17:45:57 ModelServer - BOTH API bind to: http://0.0.0.0:8080
[WARN ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stderr: [1,0]<stderr>:No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:Rolling batch inference error
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:Traceback (most recent call last):
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/rolling_batch.py", line 48, in try_catch_handling
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    return func(self, *args, **kwargs)
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/trtllm_rolling_batch.py", line 108, in inference
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    response = self.model.generate(request.input_text, **param)
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmmodel/modelbuilder.py", line 268, in generate
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    final_kwargs = self._prepare_inputs_for_generation(inputs, **parameters)
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmmodel/modelbuilder.py", line 341, in _prepare_inputs_for_generation
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    parameters["stop_words_list"] = tensorrt_llm.runtime.to_word_list_format(stop_sequences, self.tokenizer)
[INFO ] 2024-08-07 18:01:48 PyProcess - W-20055-model-stdout: [1,0]<stdout>:AttributeError: module 'tensorrt_llm.runtime' has no attribute 'to_word_list_format'
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:Rolling batch inference error
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:Traceback (most recent call last):
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/rolling_batch.py", line 48, in try_catch_handling
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    return func(self, *args, **kwargs)
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/trtllm_rolling_batch.py", line 108, in inference
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    response = self.model.generate(request.input_text, **param)
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmmodel/modelbuilder.py", line 268, in generate
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    final_kwargs = self._prepare_inputs_for_generation(inputs, **parameters)
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmmodel/modelbuilder.py", line 341, in _prepare_inputs_for_generation
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    parameters["stop_words_list"] = tensorrt_llm.runtime.to_word_list_format(stop_sequences, self.tokenizer)
[INFO ] 2024-08-07 18:01:49 PyProcess - W-20055-model-stdout: [1,0]<stdout>:AttributeError: module 'tensorrt_llm.runtime' has no attribute 'to_word_list_format'
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:Rolling batch inference error
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:Traceback (most recent call last):
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/rolling_batch.py", line 48, in try_catch_handling
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    return func(self, *args, **kwargs)
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/tmp/.djl.ai/python/0.29.0/djl_python/rolling_batch/trtllm_rolling_batch.py", line 108, in inference
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    response = self.model.generate(request.input_text, **param)
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmmodel/modelbuilder.py", line 268, in generate
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    final_kwargs = self._prepare_inputs_for_generation(inputs, **parameters)
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:  File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm_toolkit/trtllmmodel/modelbuilder.py", line 341, in _prepare_inputs_for_generation
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:    parameters["stop_words_list"] = tensorrt_llm.runtime.to_word_list_format(stop_sequences, self.tokenizer)
[INFO ] 2024-08-07 18:01:50 PyProcess - W-20055-model-stdout: [1,0]<stdout>:AttributeError: module 'tensorrt_llm.runtime' has no attribute 'to_word_list_format'

How to Reproduce?

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. aws s3 sync s3://djl-llm/llama-2-7b-hf/ llama-2-7b-hf/

  2. docker run -it --gpus all --shm-size 20g -v /home/ubuntu/trtllm/llama-2-7b:/opt/ml/model -p 8080:8080 763104351884.dkr.ecr.us-east-1.amazonaws.com/djl-inference:0.29.0-tensorrtllm0.11.0-cu124

  3. lm_eval --model local-chat-completions --tasks gsm8k_cot_zeroshot --model_args model=meta-llama/Meta-Llama-2-7B,base_url=http://localhost:8080/v1/chat/completions/model,tokenized_requests=True --limit 10 --apply_chat_template --write_out --log_samples --output_path ~/trtllm/lm_eval/output_llama-2-7b-gsm8k_cot_zeroshot_v11

    2024-08-07:18:01:48,442 INFO     [evaluator_utils.py:200] Request: Instance(request_type='generate_until', doc={'question': "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?", 'answer': 'Janet sells 16 - 3 - 4 = <<16-3-4=9>>9 duck eggs a day.\nShe makes 9 * 2 = $<<9*2=18>>18 every day at the farmer’s market.\n#### 18'}, arguments=(JsonChatStr(prompt='[{"role": "user", "content": "Q: Janet\\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers\' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers\' market?\\nA: Let\'s think step by step."}]'), {'until': ['Q:', '</s>', '<|im_end|>'], 'do_sample': False}), idx=0, metadata=('gsm8k_cot_zeroshot', 0, 1), resps=[], filtered_resps={}, task_name='gsm8k_cot_zeroshot', doc_id=0, repeats=1)
    2024-08-07:18:01:48,442 INFO     [evaluator.py:457] Running generate_until requests
    Requesting API:   0%|                                                           | 0/10 [00:00<?, ?it/s]2024-08-07:18:01:48,539 WARNING  [api_models.py:342] API request failed with error message: {"error":"module \u0027tensorrt_llm.runtime\u0027 has no attribute \u0027to_word_list_format\u0027","code":424}. Retrying...
    2024-08-07:18:01:49,550 WARNING  [api_models.py:342] API request failed with error message: {"error":"module \u0027tensorrt_llm.runtime\u0027 has no attribute \u0027to_word_list_format\u0027","code":424}. Retrying...
    2024-08-07:18:01:50,558 WARNING  [api_models.py:342] API request failed with error message: {"error":"module \u0027tensorrt_llm.runtime\u0027 has no attribute \u0027to_word_list_format\u0027","code":424}. Retrying...
    Traceback (most recent call last):
    File "/opt/conda/envs/py310/bin/lm_eval", line 8, in <module>
    sys.exit(cli_evaluate())
    File "/home/ubuntu/lm-evaluation-harness/lm_eval/__main__.py", line 382, in cli_evaluate
    results = evaluator.simple_evaluate(
    File "/home/ubuntu/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
    File "/home/ubuntu/lm-evaluation-harness/lm_eval/evaluator.py", line 296, in simple_evaluate
    results = evaluate(
    File "/home/ubuntu/lm-evaluation-harness/lm_eval/utils.py", line 397, in _wrapper
    return fn(*args, **kwargs)
    File "/home/ubuntu/lm-evaluation-harness/lm_eval/evaluator.py", line 468, in evaluate
    resps = getattr(lm, reqtype)(cloned_reqs)
    File "/home/ubuntu/lm-evaluation-harness/lm_eval/models/api_models.py", line 562, in generate_until
    outputs = retry(
    File "/opt/conda/envs/py310/lib/python3.10/site-packages/tenacity/__init__.py", line 336, in wrapped_f
    return copy(f, *args, **kw)
    File "/opt/conda/envs/py310/lib/python3.10/site-packages/tenacity/__init__.py", line 475, in __call__
    do = self.iter(retry_state=retry_state)
    File "/opt/conda/envs/py310/lib/python3.10/site-packages/tenacity/__init__.py", line 376, in iter
    result = action(retry_state)
    File "/opt/conda/envs/py310/lib/python3.10/site-packages/tenacity/__init__.py", line 418, in exc_check
    raise retry_exc.reraise()
    File "/opt/conda/envs/py310/lib/python3.10/site-packages/tenacity/__init__.py", line 185, in reraise
    raise self.last_attempt.result()
    File "/opt/conda/envs/py310/lib/python3.10/concurrent/futures/_base.py", line 451, in result
    return self.__get_result()
    File "/opt/conda/envs/py310/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
    File "/opt/conda/envs/py310/lib/python3.10/site-packages/tenacity/__init__.py", line 478, in __call__
    result = fn(*args, **kwargs)
    File "/home/ubuntu/lm-evaluation-harness/lm_eval/models/api_models.py", line 345, in model_call
    response.raise_for_status()
    File "/opt/conda/envs/py310/lib/python3.10/site-packages/requests/models.py", line 1024, in raise_for_status
    raise HTTPError(http_error_msg, response=self)
    requests.exceptions.HTTPError: 424 Client Error: module 'tensorrt_llm.runtime' has no attribute 'to_word_list_format' for url: http://localhost:8080/v1/chat/completions/model
    Requesting API:   0%|

What have you tried to solve it?

1. 2.

sindhuvahinis commented 2 months ago

Thanks for reporting this. Will take a look at it today.

pdtgct commented 2 months ago

I can confirm seeing this issue in djl-inference:0.29.0-tensorrtllm0.11.0-cu124.

Steps to reproduce:

Send a POST request with the stop parameter:

{
  "inputs": "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\nYou are rolling a 12-sided dice twice.\n\nQuestion: Can I win more than once?\n<|eot_id|>\n\n<|start_header_id|>assistant<|end_header_id|> Answer:",
  "parameters": {
    "do_sample": false,
    "details": false,
    "temperature": 0.7,
    "top_p": 0.92,
    "max_new_tokens": 220,
    "stop": ["<|eot_id|>"]
  }
}

Note: the model does not stop on "<|eot_id|>" so the stop parameter is needed.

sindhuvahinis commented 1 month ago

We fixed it the image and released the patched image @lxning try it now.

@pdtgct Could you try with stop_sequences instead of just stop?

pdtgct commented 1 month ago

Thanks, @sindhuvahinis - will try to find some time to confirm.