Open captify-sivakhno opened 4 weeks ago
I actually run into a similar issue as well when trying to add CFG support to https://github.com/huggingface/text-generation-inference. Same error message with the same code path towards the last 3 function calls (see trace below). Any hints would be appreciated.
2024-10-31T06:56:33.558057Z ERROR text_generation_launcher: Method Decode encountered an error. Traceback (most recent call last):
File "/opt/conda/bin/text-generation-server", line 8, in <module> sys.exit(app())
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__ return get_command(self)(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__ return self.main(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main return _main(
File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main rv = self.invoke(ctx)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke return ctx.invoke(self.callback, **ctx.params)
File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke return __callback(*args, **kwargs)
File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper return callback(**use_params) # type: ignore
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 116, in serve server.serve(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve asyncio.run(
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run return runner.run(main)
File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
self.run_forever()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
self._run_once()
File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
handle._run()
File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
self._context.run(self._callback, *self._args)
File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
return await response
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
raise error
File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
return await behavior(request_or_iterator, context)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 218, in Decode
generations, next_batch, timings = self.model.generate_token(batch)
File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
return func(*args, **kwds)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1968, in generate_token
) = batch.next_token_chooser(
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/tokens.py", line 364, in __call__
_scores = self.grammar_processor(_scores, self.fsm_grammar_states)
File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/logits_process.py", line 597, in __call__
allowed_tokens = fsm.get_next_instruction(fsm_grammar_states[i]).tokens
File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 154, in get_next_instruction
valid_tokens = list(
File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 189, in iter_valid_token_ids
self._get_parser_state_token_applied(state, int(token_id))
File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 241, in _get_parser_state_token_applied
prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3999, in decode
return self._decode(
File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer
It's hard to know if it's an issue on their end or ours. Running the same in outlines
directly should tell us.
I found the same issue, I think it goes wrong because of this line (outlines/fsm/guide.py:241):
prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
The tokenizer does not expect a 2d list. Changing it to:
prev_token_str = self.tokenizer.decode([state.prev_token])[0]
Fixes it for me, but I stumble upon another issue after (could be unrelated).
Hi everyone,
I encountered an issue when attempting to use the generate.cfg
function with a VLLM
model. The code throws a NotImplementedError
, indicating that the CFG Logits processor is not available for the VLLM
class.
Exception has occurred: NotImplementedError
The CFG Logits processor is not available for <class 'outlines.models.vllm.VLLM'>.
File "/home/lepagnol/Documents/These/format-constrained-for-slu/vllm_test.py", line 30, in <module>
generator = generate.cfg(model, arithmetic_grammar)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The CFG Logits processor is not available for <class 'outlines.models.vllm.VLLM'>.
from vllm import LLM, SamplingParams
llm = LLM(
"neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",
enable_prefix_caching=True,
block_size=64,
max_num_batched_tokens=15000,
gpu_memory_utilization=0.96,
max_model_len=15000,
use_v2_block_manager=True,
)
arithmetic_grammar = """
?start: expression
?expression: term (("+" | "-") term)*
?term: factor (("*" | "/") factor)*
?factor: NUMBER
| "-" factor
| "(" expression ")"
%import common.NUMBER
"""
from outlines import generate, models
model = models.VLLM(llm)
generator = generate.cfg(model, arithmetic_grammar)
sampling_params = SamplingParams(temperature=0.3, top_p=0.2, max_tokens=20)
sequence = generator(
"Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:",
sampling_params=sampling_params,
)
I expected the code to generate a sequence based on the defined grammar using the VLLM
model.
The code throws a NotImplementedError
, suggesting that the CFG Logits processor is not implemented for the VLLM
model.
3.12
0.0.46
0.6.4.post2.dev67+g63f1fde2.cpu
neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8
Is the CFG Logits processor not yet supported for VLLM
, or is there a workaround for this issue? If it's a known limitation, are there any plans to support it in the future?
Thank you!
Describe the issue as clearly as possible:
When running provided arithmetic grammar example with vLLM, I get an error
TypeError: Error in model execution: argument 'ids': 'list' object cannot be interpreted as an integer
. I presume this comes from de-tokenization, but still not sure how to fix it. Any suggestions would be welcome, as we have used outlines with vLLM successfully on a number of other use cases and really like the tool!Steps/code to reproduce the bug:
Expected result:
Error message:
Outlines/Python version information:
Version information
Context for the issue:
No response