context-free grammars example does not work with vLLM integration

captify-sivakhno commented 4 weeks ago

Describe the issue as clearly as possible:

When running provided arithmetic grammar example with vLLM, I get an error TypeError: Error in model execution: argument 'ids': 'list' object cannot be interpreted as an integer. I presume this comes from de-tokenization, but still not sure how to fix it. Any suggestions would be welcome, as we have used outlines with vLLM successfully on a number of other use cases and really like the tool!

Steps/code to reproduce the bug:

from vllm import LLM, SamplingParams

llm = LLM(
    "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",
    enable_prefix_caching=True,
    block_size=64,
    max_num_batched_tokens=15000,
    gpu_memory_utilization=0.96,
    max_model_len=15000,
    use_v2_block_manager=True,

)

arithmetic_grammar = """
    ?start: expression

    ?expression: term (("+" | "-") term)*

    ?term: factor (("*" | "/") factor)*

    ?factor: NUMBER
           | "-" factor
           | "(" expression ")"

    %import common.NUMBER
"""

from outlines import models, generate
model = models.VLLM(llm)
generator = generate.cfg(model, arithmetic_grammar)
sampling_params = SamplingParams(temperature=0.3, top_p=0.2, max_tokens=20)

sequence = generator("Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:", sampling_params=sampling_params)

Expected result:

(4-2)

Error message:

TypeError: Error in model execution: argument 'ids': 'list' object cannot be interpreted as an integer
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner_base.py:116, in dump_input_when_exception.<locals>._inner.<locals>._wrapper(*args, **kwargs)
    115 try:
--> 116     return func(*args, **kwargs)
    117 except Exception as err:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner.py:1698, in ModelRunner.execute_model(self, model_input, kv_caches, intermediate_tensors, num_steps)
   1696     return hidden_or_intermediate_states
-> 1698 logits = self.model.compute_logits(hidden_or_intermediate_states,
   1699                                    model_input.sampling_metadata)
   1701 if not self.is_driver_worker:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/models/llama.py:565, in LlamaForCausalLM.compute_logits(self, hidden_states, sampling_metadata)
    560 def compute_logits(
    561     self,
    562     hidden_states: torch.Tensor,
    563     sampling_metadata: SamplingMetadata,
    564 ) -> Optional[torch.Tensor]:
--> 565     logits = self.logits_processor(self.lm_head, hidden_states,
    566                                    sampling_metadata)
    567     return logits
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/nn/modules/module.py:1553, in Module._wrapped_call_impl(self, *args, **kwargs)
   1552 else:
-> 1553     return self._call_impl(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/nn/modules/module.py:1562, in Module._call_impl(self, *args, **kwargs)
   1559 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1560         or _global_backward_pre_hooks or _global_backward_hooks
   1561         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1562     return forward_call(*args, **kwargs)
   1564 try:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py:72, in LogitsProcessor.forward(self, lm_head, hidden_states, sampling_metadata, embedding_bias)
     71     # Apply logits processors (if any).
---> 72     logits = _apply_logits_processors(logits, sampling_metadata)
     74 return logits
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/model_executor/layers/logits_processor.py:142, in _apply_logits_processors(logits, sampling_metadata)
    141     else:
--> 142         logits_row = logits_processor(past_tokens_ids,
    143                                       logits_row)
    145 logits[logits_row_idx] = logits_row
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    115 with ctx_factory():
--> 116     return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/processors/base_logits_processor.py:80, in OutlinesLogitsProcessor.__call__(self, input_ids, logits)
     79 elif len(torch_logits.shape) == 1:
---> 80     processed_logits = self.process_logits(
     81         input_ids.unsqueeze(0), torch_logits.unsqueeze(0)
     82     ).squeeze(0)
     84 # return logits as passed array type
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/processors/structured.py:239, in CFGLogitsProcessor.process_logits(self, input_ids, logits)
    238 for i, guide_state in enumerate(sequence_states):
--> 239     first_legal_token = next(
    240         self.guide.iter_valid_token_ids(
    241             guide_state, torch.argsort(logits[i], descending=True)
    242         )
    243     )
    244     mask[i, [first_legal_token]] = logits[i, [first_legal_token]]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/fsm/guide.py:189, in CFGGuide.iter_valid_token_ids(self, state, candidate_token_ids)
    188 try:
--> 189     self._get_parser_state_token_applied(state, int(token_id))
    190     yield token_id
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/fsm/guide.py:241, in CFGGuide._get_parser_state_token_applied(self, state, token_id)
    240 else:
--> 241     prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
    242     combined_token_str = self.tokenizer.decode([[state.prev_token, token_id]])[
    243         0
    244     ]
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/transformers/tokenization_utils_base.py:4004, in PreTrainedTokenizerBase.decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
   4002 token_ids = to_py_obj(token_ids)
-> 4004 return self._decode(
   4005     token_ids=token_ids,
   4006     skip_special_tokens=skip_special_tokens,
   4007     clean_up_tokenization_spaces=clean_up_tokenization_spaces,
   4008     **kwargs,
   4009 )
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py:654, in PreTrainedTokenizerFast._decode(self, token_ids, skip_special_tokens, clean_up_tokenization_spaces, **kwargs)
    653     token_ids = [token_ids]
--> 654 text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
    656 clean_up_tokenization_spaces = (
    657     clean_up_tokenization_spaces
    658     if clean_up_tokenization_spaces is not None
    659     else self.clean_up_tokenization_spaces
    660 )
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

The above exception was the direct cause of the following exception:
TypeError                                 Traceback (most recent call last)
File <command-146369106289477>, line 15
      1 arithmetic_grammar = """
      2     ?start: expression
      3 
   (...)
     12     %import common.NUMBER
     13 """
     14 generator1 = generate.cfg(model, arithmetic_grammar)
---> 15 generator1("1+1=")
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/generate/api.py:504, in SequenceGeneratorAdapter.__call__(self, prompts, max_tokens, stop_at, seed, **model_specific_params)
    498 """Generate text from a prompt of list of prompts."""
    500 generation_params = self.prepare_generation_parameters(
    501     max_tokens, stop_at, seed
    502 )
--> 504 completions = self.model.generate(
    505     prompts,
    506     generation_params,
    507     copy(self.logits_processor),
    508     self.sampling_params,
    509     **model_specific_params,
    510 )
    512 return self._format(completions)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/outlines/models/vllm.py:131, in VLLM.generate(self, prompts, generation_parameters, logits_processor, sampling_parameters, sampling_params, use_tqdm)
    128 if sampler == "beam_search":
    129     sampling_params.use_beam_search = True
--> 131 results = self.model.generate(
    132     prompts,
    133     sampling_params=sampling_params,
    134     lora_request=self.lora_request,
    135     use_tqdm=use_tqdm,
    136 )
    137 results = [[sample.text for sample in batch.outputs] for batch in results]
    139 batch_size = len(results)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/utils.py:1063, in deprecate_kwargs.<locals>.wrapper.<locals>.inner(*args, **kwargs)
   1056             msg += f" {additional_message}"
   1058         warnings.warn(
   1059             DeprecationWarning(msg),
   1060             stacklevel=3,  # The inner function takes up one level
   1061         )
-> 1063 return fn(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/entrypoints/llm.py:353, in LLM.generate(self, prompts, sampling_params, prompt_token_ids, use_tqdm, lora_request, prompt_adapter_request, guided_options_request, priority)
    343     sampling_params = SamplingParams()
    345 self._validate_and_add_requests(
    346     prompts=parsed_prompts,
    347     params=sampling_params,
   (...)
    350     guided_options=guided_options_request,
    351     priority=priority)
--> 353 outputs = self._run_engine(use_tqdm=use_tqdm)
    354 return LLMEngine.validate_outputs(outputs, RequestOutput)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/entrypoints/llm.py:879, in LLM._run_engine(self, use_tqdm)
    877 total_out_toks = 0
    878 while self.llm_engine.has_unfinished_requests():
--> 879     step_outputs = self.llm_engine.step()
    880     for output in step_outputs:
    881         if output.finished:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/engine/llm_engine.py:1386, in LLMEngine.step(self)
   1382 if allow_async_output_proc:
   1383     execute_model_req.async_callback = self.async_callbacks[
   1384         virtual_engine]
-> 1386 outputs = self.model_executor.execute_model(
   1387     execute_model_req=execute_model_req)
   1389 # We need to do this here so that last step's sampled_token_ids can
   1390 # be passed to the next iteration for PP.
   1391 if self.scheduler_config.is_multi_step:
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/executor/gpu_executor.py:134, in GPUExecutor.execute_model(self, execute_model_req)
    131 def execute_model(
    132     self, execute_model_req: ExecuteModelRequest
    133 ) -> Optional[List[Union[SamplerOutput, PoolerOutput]]]:
--> 134     output = self.driver_worker.execute_model(execute_model_req)
    135     return output
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/worker_base.py:327, in LocalOrDistributedWorkerBase.execute_model(self, execute_model_req)
    322     if (self.observability_config is not None
    323             and self.observability_config.collect_model_execute_time):
    324         orig_model_execute_time = intermediate_tensors.tensors.get(
    325             "model_execute_time", torch.tensor(0)).item()
--> 327 output = self.model_runner.execute_model(
    328     model_input=model_input,
    329     kv_caches=self.kv_cache[worker_input.virtual_engine]
    330     if self.kv_cache is not None else None,
    331     intermediate_tensors=intermediate_tensors,
    332     num_steps=num_steps,
    333     **kwargs,
    334 )
    336 model_execute_time = time.perf_counter() - start_time
    337 if not get_pp_group().is_last_rank:
    338     # output is IntermediateTensors
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/torch/utils/_contextlib.py:116, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    113 @functools.wraps(func)
    114 def decorate_context(*args, **kwargs):
    115     with ctx_factory():
--> 116         return func(*args, **kwargs)
File /local_disk0/.ephemeral_nfs/envs/pythonEnv-ab7126a9-8a7c-4f1d-82d2-2ac2ce2d17ce/lib/python3.11/site-packages/vllm/worker/model_runner_base.py:146, in dump_input_when_exception.<locals>._inner.<locals>._wrapper(*args, **kwargs)
    142     except Exception as pickle_err:
    143         logger.warning(
    144             "Failed to pickle inputs of failed execution: %s",
    145             str(pickle_err))
--> 146         raise type(err)(f"Error in model execution: "
    147                         f"{str(err)}") from err
    149     logger.info(
    150         "Completed writing input of failed execution to %s.",
    151         filename)
    152 raise type(err)(
    153     f"Error in model execution (input dumped to {filename}): "
    154     f"{str(err)}") from err

Outlines/Python version information:

Version information

``` outlines version = 0.1.1 Python 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0] absl-py==1.0.0 accelerate==0.31.0 aiohttp==3.8.5 aiohttp-cors==0.7.0 aiosignal==1.2.0 airportsdata==20241001 annotated-types==0.7.0 anyio==3.5.0 argon2-cffi==21.3.0 argon2-cffi-bindings==21.2.0 astor==0.8.1 asttokens==2.0.5 astunparse==1.6.3 async-timeout==4.0.2 attrs==24.2.0 audioread==3.0.1 azure-core==1.30.2 azure-cosmos==4.3.1 azure-identity==1.17.1 azure-storage-blob==12.19.1 azure-storage-file-datalake==12.14.0 backcall==0.2.0 bcrypt==3.2.0 beautifulsoup4==4.12.2 black==23.3.0 bleach==4.1.0 blinker==1.4 blis==0.7.11 boto3==1.34.39 botocore==1.34.39 Brotli==1.0.9 cachetools==5.4.0 catalogue==2.0.10 category-encoders==2.6.3 certifi==2023.7.22 cffi==1.15.1 chardet==4.0.0 charset-normalizer==2.0.4 circuitbreaker==1.4.0 click==8.0.4 cloudpathlib==0.16.0 cloudpickle==2.2.1 cmdstanpy==1.2.2 colorful==0.5.6 comm==0.1.2 confection==0.1.4 configparser==5.2.0 contourpy==1.0.5 cryptography==41.0.3 cycler==0.11.0 cymem==2.0.8 Cython==0.29.32 dacite==1.8.1 databricks-automl-runtime==0.2.21 databricks-feature-engineering==0.6.0 databricks-sdk==0.20.0 dataclasses-json==0.6.7 datasets==2.19.1 dbl-tempo==0.1.26 dbus-python==1.2.18 debugpy==1.6.7 decorator==5.1.1 deepspeed==0.14.4 defusedxml==0.7.1 Deprecated==1.2.14 dill==0.3.6 diskcache==5.6.3 distlib==0.3.8 distro==1.7.0 distro-info==1.1+ubuntu0.2 dm-tree==0.1.8 einops==0.8.0 entrypoints==0.4 evaluate==0.4.2 executing==0.8.3 facets-overview==1.1.1 Farama-Notifications==0.0.4 fastapi==0.115.4 fastjsonschema==2.20.0 fasttext==0.9.2 filelock==3.13.4 flash-attn==2.5.9.post1 Flask==2.2.5 flatbuffers==24.3.25 fonttools==4.25.0 frozenlist==1.3.3 fsspec==2023.5.0 future==0.18.3 gast==0.4.0 gguf==0.10.0 gitdb==4.0.11 GitPython==3.1.27 google-api-core==2.18.0 google-auth==2.21.0 google-auth-oauthlib==1.0.0 google-cloud-core==2.4.1 google-cloud-storage==2.10.0 google-crc32c==1.5.0 google-pasta==0.2.0 google-resumable-media==2.7.1 googleapis-common-protos==1.63.0 greenlet==2.0.1 grpcio==1.60.0 grpcio-status==1.60.0 gunicorn==20.1.0 gviz-api==1.10.0 gymnasium==0.28.1 h11==0.14.0 h5py==3.10.0 hjson==3.1.0 holidays==0.45 horovod @ git+https://github.com/wenfeiy-db/horovod.git@d510b1d385628f8ac5770199c0824fd5b7e01394 htmlmin==0.1.12 httpcore==1.0.5 httplib2==0.20.2 httptools==0.6.4 httpx==0.27.0 huggingface-hub==0.23.4 idna==3.4 ImageHash==4.3.1 imageio==2.31.1 imbalanced-learn==0.11.0 importlib-metadata==6.0.0 importlib_resources==6.4.0 interegular==0.3.3 ipyflow-core==0.0.198 ipykernel==6.25.1 ipython==8.15.0 ipython-genutils==0.2.0 ipywidgets @ https://databricks-build-artifacts-manual-staging.s3-accelerate.amazonaws.com/ipywidgets/ipywidgets-7.7.2-2databricksnojsdeps-py2.py3-none-any.whl?AWSAccessKeyId=AKIAX7HWM34HCSVHYQ7M&Expires=2028837235&Signature=gJ%2BjzENPoM6UKsDxe1M3VIrgWco%3D#sha256=903ead20c8d40de671853515fcad2f34b43ebf3eff80e4df3f876b8dd64c903b isodate==0.6.1 itsdangerous==2.0.1 jax-jumpy==1.0.0 jedi==0.18.1 jeepney==0.7.1 Jinja2==3.1.2 jiter==0.6.1 jmespath==0.10.0 joblib==1.2.0 joblibspark==0.5.1 jsonpatch==1.33 jsonpointer==3.0.0 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 jupyter-server==1.23.4 jupyter_client==7.4.9 jupyter_core==5.3.0 jupyterlab-pygments==0.1.2 keras==3.2.1 keyring==23.5.0 kiwisolver==1.4.4 langchain==0.1.20 langchain-community==0.0.38 langchain-core==0.1.52 langchain-text-splitters==0.0.2 langcodes==3.4.0 langsmith==0.1.63 language_data==1.2.0 lark==1.2.2 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 lazy_loader==0.2 libclang==15.0.6.1 librosa==0.10.1 lightgbm==4.3.0 linkify-it-py==2.0.0 llvmlite==0.43.0 lm-format-enforcer==0.10.6 lxml==4.9.2 lz4==4.3.2 Mako==1.2.0 marisa-trie==1.1.1 Markdown==3.4.1 markdown-it-py==2.2.0 MarkupSafe==2.1.1 marshmallow==3.21.2 matplotlib==3.7.2 matplotlib-inline==0.1.6 mdit-py-plugins==0.3.0 mdurl==0.1.0 memray==1.13.4 mistral_common==1.4.4 mistune==0.8.4 ml-dtypes==0.3.2 mlflow-skinny==2.15.1 more-itertools==8.10.0 mosaicml-streaming==0.7.4 mpmath==1.3.0 msal==1.30.0 msal-extensions==1.2.0 msgpack==1.0.8 msgspec==0.18.6 multidict==6.0.2 multimethod==1.12 multiprocess==0.70.14 murmurhash==1.0.10 mypy-extensions==0.4.3 namex==0.0.8 nbclassic==0.5.5 nbclient==0.5.13 nbconvert==6.5.4 nbformat==5.7.0 nest-asyncio==1.5.6 networkx==3.1 ninja==1.11.1.1 nltk==3.8.1 notebook==6.5.4 notebook_shim==0.2.2 numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.555.43 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.82 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.0 oci==2.126.4 openai==1.52.2 opencensus==0.11.4 opencensus-context==0.1.3 opencv-python-headless==4.10.0.84 opentelemetry-api==1.25.0 opentelemetry-sdk==1.25.0 opentelemetry-semantic-conventions==0.46b0 opt-einsum==3.3.0 optree==0.12.1 orjson==3.10.6 outlines==0.1.1 outlines_core==0.1.14 packaging==23.2 pandas==1.5.3 pandocfilters==1.5.0 paramiko==3.4.0 parso==0.8.3 partial-json-parser==0.2.1.1.post4 pathspec==0.10.3 patsy==0.5.3 petastorm==0.12.1 pexpect==4.8.0 phik==0.12.4 pickleshare==0.7.5 pillow==10.4.0 platformdirs==3.10.0 plotly==5.9.0 pmdarima==2.0.4 pooch==1.8.1 portalocker==2.10.1 preshed==3.0.9 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.21.0 prompt-toolkit==3.0.36 prophet==1.1.5 proto-plus==1.24.0 protobuf==4.24.1 psutil==5.9.0 psycopg2==2.9.3 ptyprocess==0.7.0 pure-eval==0.2.2 py-cpuinfo==8.0.0 py-spy==0.3.14 pyairports==2.1.1 pyarrow==14.0.1 pyarrow-hotfix==0.6 pyasn1==0.4.8 pyasn1-modules==0.2.8 pybind11==2.13.1 pyccolo==0.0.52 pycountry==24.6.1 pycparser==2.21 pydantic==2.9.2 pydantic_core==2.23.4 Pygments==2.15.1 PyGObject==3.42.1 PyJWT==2.3.0 PyNaCl==1.5.0 pyodbc==4.0.38 pyOpenSSL==23.2.0 pyparsing==3.0.9 pyrsistent==0.18.0 pytesseract==0.3.10 python-apt==2.4.0+ubuntu3 python-dateutil==2.8.2 python-dotenv==1.0.1 python-editor==1.0.4 python-lsp-jsonrpc==1.1.1 python-snappy==0.6.1 pytz==2022.7 PyWavelets==1.4.1 PyYAML==6.0 pyzmq==23.2.0 ray==2.35.0 referencing==0.35.1 regex==2022.7.9 requests==2.31.0 requests-oauthlib==1.3.1 rich==13.7.1 rpds-py==0.20.0 rsa==4.9 s3transfer==0.10.2 safetensors==0.4.2 scikit-image==0.20.0 scikit-learn==1.3.0 scipy==1.11.1 seaborn==0.12.2 SecretStorage==3.3.1 Send2Trash==1.8.0 sentence-transformers==2.7.0 sentencepiece==0.2.0 shap==0.44.0 simplejson==3.17.6 six==1.16.0 slicer==0.0.7 smart-open==5.2.1 smmap==5.0.0 sniffio==1.2.0 soundfile==0.12.1 soupsieve==2.4 soxr==0.3.7 spacy==3.7.2 spacy-legacy==3.0.12 spacy-loggers==1.0.5 spark-tensorflow-distributor==1.0.0 SQLAlchemy==1.4.39 sqlparse==0.4.2 srsly==2.4.8 ssh-import-id==5.11 stack-data==0.2.0 stanio==0.5.1 starlette==0.41.2 statsmodels==0.14.0 sympy==1.11.1 tangled-up-in-unicode==0.2.0 tenacity==8.2.2 tensorboard==2.16.2 tensorboard-data-server==0.7.2 tensorboard_plugin_profile==2.15.1 tensorboardX==2.6.2.2 tensorflow==2.16.1 tensorflow-estimator==2.15.0 tensorflow-io-gcs-filesystem==0.37.1 termcolor==2.4.0 terminado==0.17.1 textual==0.63.3 tf_keras==2.16.0 thinc==8.2.3 threadpoolctl==2.2.0 tifffile==2021.7.2 tiktoken==0.7.0 tinycss2==1.2.1 tokenize-rt==4.2.1 tokenizers==0.20.1 torch==2.4.0 torcheval==0.0.7 torchvision==0.19.0 tornado==6.3.2 tqdm==4.65.0 traitlets==5.7.1 transformers==4.46.0 triton==3.0.0 typeguard==2.13.3 typer==0.9.4 typing-inspect==0.9.0 typing_extensions==4.12.2 tzdata==2022.1 uc-micro-py==1.0.1 ujson==5.4.0 unattended-upgrades==0.1 urllib3==1.26.16 uvicorn==0.32.0 uvloop==0.21.0 virtualenv==20.24.2 visions==0.7.5 vllm==0.6.3 wadllib==1.3.6 wasabi==1.1.2 watchfiles==0.24.0 wcwidth==0.2.5 weasel==0.3.4 webencodings==0.5.1 websocket-client==0.58.0 websockets==13.1 Werkzeug==2.2.3 wordcloud==1.9.3 wrapt==1.14.1 xformers==0.0.27.post2 xgboost==2.0.3 xxhash==3.4.1 yarl==1.8.1 ydata-profiling==4.5.1 zipp==3.11.0 zstd==1.5.5.1 ``````

Context for the issue:

No response

dszeto commented 3 weeks ago

I actually run into a similar issue as well when trying to add CFG support to https://github.com/huggingface/text-generation-inference. Same error message with the same code path towards the last 3 function calls (see trace below). Any hints would be appreciated.

2024-10-31T06:56:33.558057Z ERROR text_generation_launcher: Method Decode encountered an error.                                                                                                                       Traceback (most recent call last):
  File "/opt/conda/bin/text-generation-server", line 8, in <module>                                                                                                                                                       sys.exit(app())
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 311, in __call__                                                                                                                                     return get_command(self)(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1157, in __call__                                                                                                                                    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 778, in main                                                                                                                                         return _main(
  File "/opt/conda/lib/python3.11/site-packages/typer/core.py", line 216, in _main                                                                                                                                        rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1688, in invoke                                                                                                                                      return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 1434, in invoke                                                                                                                                      return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.11/site-packages/click/core.py", line 783, in invoke                                                                                                                                       return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.11/site-packages/typer/main.py", line 683, in wrapper                                                                                                                                      return callback(**use_params)  # type: ignore
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/cli.py", line 116, in serve                                                                                                                        server.serve(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 315, in serve                                                                                                                     asyncio.run(
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 190, in run                                                                                                                                                   return runner.run(main)
  File "/opt/conda/lib/python3.11/asyncio/runners.py", line 118, in run
    return self._loop.run_until_complete(task)
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 641, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 608, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.11/asyncio/base_events.py", line 1936, in _run_once
    handle._run()
  File "/opt/conda/lib/python3.11/asyncio/events.py", line 84, in _run
    self._context.run(self._callback, *self._args)
  File "/opt/conda/lib/python3.11/site-packages/grpc_interceptor/server.py", line 165, in invoke_intercept_method
    return await self.intercept(
> File "/opt/conda/lib/python3.11/site-packages/text_generation_server/interceptor.py", line 24, in intercept
    return await response
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 120, in _unary_interceptor
    raise error
  File "/opt/conda/lib/python3.11/site-packages/opentelemetry/instrumentation/grpc/_aio_server.py", line 111, in _unary_interceptor
    return await behavior(request_or_iterator, context)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/server.py", line 218, in Decode
    generations, next_batch, timings = self.model.generate_token(batch)
  File "/opt/conda/lib/python3.11/contextlib.py", line 81, in inner
    return func(*args, **kwds)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/models/flash_causal_lm.py", line 1968, in generate_token
    ) = batch.next_token_chooser(
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/tokens.py", line 364, in __call__
    _scores = self.grammar_processor(_scores, self.fsm_grammar_states)
  File "/opt/conda/lib/python3.11/site-packages/text_generation_server/utils/logits_process.py", line 597, in __call__
    allowed_tokens = fsm.get_next_instruction(fsm_grammar_states[i]).tokens
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 154, in get_next_instruction
    valid_tokens = list(
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 189, in iter_valid_token_ids
    self._get_parser_state_token_applied(state, int(token_id))
  File "/opt/conda/lib/python3.11/site-packages/outlines/fsm/guide.py", line 241, in _get_parser_state_token_applied
    prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 3999, in decode
    return self._decode(
  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 654, in _decode
    text = self._tokenizer.decode(token_ids, skip_special_tokens=skip_special_tokens)
TypeError: argument 'ids': 'list' object cannot be interpreted as an integer

rlouf commented 3 weeks ago

It's hard to know if it's an issue on their end or ours. Running the same in outlines directly should tell us.

CompuIves commented 2 weeks ago

I found the same issue, I think it goes wrong because of this line (outlines/fsm/guide.py:241):

prev_token_str = self.tokenizer.decode([[state.prev_token]])[0]

The tokenizer does not expect a 2d list. Changing it to:

prev_token_str = self.tokenizer.decode([state.prev_token])[0]

Fixes it for me, but I stumble upon another issue after (could be unrelated).

PierreLepagnol commented 4 days ago

Hi everyone, I encountered an issue when attempting to use the generate.cfg function with a VLLM model. The code throws a NotImplementedError, indicating that the CFG Logits processor is not available for the VLLM class.

Error Message

Exception has occurred: NotImplementedError
The CFG Logits processor is not available for <class 'outlines.models.vllm.VLLM'>.
  File "/home/lepagnol/Documents/These/format-constrained-for-slu/vllm_test.py", line 30, in <module>
    generator = generate.cfg(model, arithmetic_grammar)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NotImplementedError: The CFG Logits processor is not available for <class 'outlines.models.vllm.VLLM'>.

Code to Reproduce

from vllm import LLM, SamplingParams

llm = LLM(
    "neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8",
    enable_prefix_caching=True,
    block_size=64,
    max_num_batched_tokens=15000,
    gpu_memory_utilization=0.96,
    max_model_len=15000,
    use_v2_block_manager=True,
)

arithmetic_grammar = """
    ?start: expression

    ?expression: term (("+" | "-") term)*

    ?term: factor (("*" | "/") factor)*

    ?factor: NUMBER
           | "-" factor
           | "(" expression ")"

    %import common.NUMBER
"""

from outlines import generate, models

model = models.VLLM(llm)
generator = generate.cfg(model, arithmetic_grammar)
sampling_params = SamplingParams(temperature=0.3, top_p=0.2, max_tokens=20)

sequence = generator(
    "Alice had 4 apples and Bob ate 2. Write an expression for Alice's apples:",
    sampling_params=sampling_params,
)

Expected Behavior

I expected the code to generate a sequence based on the defined grammar using the VLLM model.

Actual Behavior

The code throws a NotImplementedError, suggesting that the CFG Logits processor is not implemented for the VLLM model.

Environment

Python version: 3.12
Outlines version: 0.0.46
VLLM version: 0.6.4.post2.dev67+g63f1fde2.cpu
Model: neuralmagic/Llama-3.2-1B-Instruct-quantized.w8a8

Additional Context

Is the CFG Logits processor not yet supported for VLLM, or is there a workaround for this issue? If it's a known limitation, are there any plans to support it in the future?

Thank you!

dottxt-ai / outlines