dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.57k stars 492 forks source link

JSON generation example fails using vLLM: JSONDecodeError #1214

Closed lemig closed 1 month ago

lemig commented 1 month ago

Describe the issue as clearly as possible:

I have tried your Pydantic example from: https://dottxt-ai.github.io/outlines/latest/reference/generation/json/

I works OK, as is, with: model = models.transformers("microsoft/Phi-3-mini-4k-instruct")

Also OK: model = models.transformers("microsoft/Phi-3-mini-4k-instruct", device="cuda")

But JSONDecodeError with: model = models.vllm("microsoft/Phi-3-mini-4k-instruct", tensor_parallel_size=4)

Steps/code to reproduce the bug:

from pydantic import BaseModel
from outlines import models, generate

class User(BaseModel):
    name: str
    last_name: str
    id: int

model = models.vllm(
    "microsoft/Phi-3-mini-4k-instruct", 
    tensor_parallel_size=4
)

generator = generate.json(model, User)
print("generator OK")
result = generator(
    "Create a user profile with the fields name, last_name and id"
)
print(result)

Expected result:

# User(name="John", last_name="Doe", id=11)

Error message:

/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/vllm/connections.py:8: RuntimeWarning: Failed to read commit hash:
No module named 'vllm._version'
  from vllm.version import __version__ as VLLM_VERSION
INFO 10-18 17:20:08 config.py:887] Defaulting to use mp for distributed inference
INFO 10-18 17:20:08 llm_engine.py:237] Initializing an LLM engine (vdev) with config: model='microsoft/Phi-3-mini-4k-instruct', speculative_config=None, tokenizer='microsoft/Phi-3-mini-4k-instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=4, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=microsoft/Phi-3-mini-4k-instruct, use_v2_block_manager=True, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=False, mm_processor_kwargs=None)
WARNING 10-18 17:20:08 multiproc_gpu_executor.py:53] Reducing Torch parallelism from 16 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
INFO 10-18 17:20:08 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 10-18 17:20:09 selector.py:247] Cannot use FlashAttention-2 backend due to sliding window.
INFO 10-18 17:20:09 selector.py:115] Using XFormers backend.
/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:09 selector.py:247] Cannot use FlashAttention-2 backend due to sliding window.
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:09 selector.py:115] Using XFormers backend.
/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:09 selector.py:247] Cannot use FlashAttention-2 backend due to sliding window.
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:09 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:09 selector.py:247] Cannot use FlashAttention-2 backend due to sliding window.
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:09 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=2652844) /home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2652844)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2652843) /home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2652845) /home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2652843)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2652845)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=2652844) /home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2652844)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2652845) /home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2652845)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2652843) /home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=2652843)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:09 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:09 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:09 multiproc_worker_utils.py:216] Worker ready; awaiting tasks
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:10 utils.py:1008] Found nccl from library libnccl.so.2
INFO 10-18 17:20:10 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:10 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:10 utils.py:1008] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:10 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 10-18 17:20:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:10 pynccl.py:63] vLLM is using nccl==2.20.5
(VllmWorkerProcess pid=2652845) WARNING 10-18 17:20:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
WARNING 10-18 17:20:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2652843) WARNING 10-18 17:20:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
(VllmWorkerProcess pid=2652844) WARNING 10-18 17:20:10 custom_all_reduce.py:132] Custom allreduce is disabled because it's not supported on more than two PCIe-only GPUs. To silence this warning, specify disable_custom_all_reduce=True explicitly.
INFO 10-18 17:20:10 shm_broadcast.py:241] vLLM message queue communication handle: Handle(connect_ip='127.0.0.1', local_reader_ranks=[1, 2, 3], buffer=<vllm.distributed.device_communicators.shm_broadcast.ShmRingBuffer object at 0x7f756a461310>, local_subscribe_port=36793, remote_subscribe_port=None)
INFO 10-18 17:20:10 model_runner.py:1060] Starting to load model microsoft/Phi-3-mini-4k-instruct...
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:10 model_runner.py:1060] Starting to load model microsoft/Phi-3-mini-4k-instruct...
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:10 model_runner.py:1060] Starting to load model microsoft/Phi-3-mini-4k-instruct...
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:10 model_runner.py:1060] Starting to load model microsoft/Phi-3-mini-4k-instruct...
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:10 selector.py:247] Cannot use FlashAttention-2 backend due to sliding window.
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:10 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:10 selector.py:247] Cannot use FlashAttention-2 backend due to sliding window.
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:10 selector.py:115] Using XFormers backend.
INFO 10-18 17:20:10 selector.py:247] Cannot use FlashAttention-2 backend due to sliding window.
INFO 10-18 17:20:10 selector.py:115] Using XFormers backend.
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:10 selector.py:247] Cannot use FlashAttention-2 backend due to sliding window.
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:10 selector.py:115] Using XFormers backend.
INFO 10-18 17:20:10 weight_utils.py:243] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:10 weight_utils.py:243] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:10 weight_utils.py:243] Using model weights format ['*.safetensors']
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:11 weight_utils.py:243] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  50% Completed | 1/2 [00:00<00:00,  6.51it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  4.04it/s]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:00<00:00,  4.28it/s]

INFO 10-18 17:20:11 model_runner.py:1071] Loading model weights took 1.7960 GB
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:12 model_runner.py:1071] Loading model weights took 1.7960 GB
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:12 model_runner.py:1071] Loading model weights took 1.7960 GB
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:12 model_runner.py:1071] Loading model weights took 1.7960 GB
INFO 10-18 17:20:13 distributed_gpu_executor.py:57] # GPU blocks: 25474, # CPU blocks: 2730
INFO 10-18 17:20:13 distributed_gpu_executor.py:61] Maximum concurrency for 4096 tokens per request: 99.51x
INFO 10-18 17:20:15 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
INFO 10-18 17:20:15 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:15 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:15 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:15 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:15 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:15 model_runner.py:1402] Capturing the model for CUDA graphs. This may lead to unexpected consequences if the model is not static. To run the model in eager mode, set 'enforce_eager=True' or use '--enforce-eager' in the CLI.
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:15 model_runner.py:1406] CUDA graphs can take additional 1~3 GiB memory per GPU. If you are running out of memory, consider decreasing `gpu_memory_utilization` or enforcing eager mode. You can also reduce the `max_num_seqs` as needed to decrease memory usage.
(VllmWorkerProcess pid=2652845) INFO 10-18 17:20:33 model_runner.py:1530] Graph capturing finished in 17 secs.
INFO 10-18 17:20:33 model_runner.py:1530] Graph capturing finished in 18 secs.
(VllmWorkerProcess pid=2652844) INFO 10-18 17:20:33 model_runner.py:1530] Graph capturing finished in 18 secs.
(VllmWorkerProcess pid=2652843) INFO 10-18 17:20:33 model_runner.py:1530] Graph capturing finished in 18 secs.
Compiling FSM index for all state transitions: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:00<00:00, 107.74it/s]
generator OK
Processed prompts: 100%|███████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.83it/s, est. speed input: 109.80 toks/s, output: 125.47 toks/s]
[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/pydantic/main.py", line 1187, in parse_raw
[rank0]:     obj = parse.load_str_bytes(
[rank0]:           ^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/pydantic/deprecated/parse.py", line 49, in load_str_bytes
[rank0]:     return json_loads(b)  # type: ignore
[rank0]:            ^^^^^^^^^^^^^
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/json/__init__.py", line 346, in loads
[rank0]:     return _default_decoder.decode(s)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/json/decoder.py", line 337, in decode
[rank0]:     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/json/decoder.py", line 353, in raw_decode
[rank0]:     obj, end = self.scan_once(s, idx)
[rank0]:                ^^^^^^^^^^^^^^^^^^^^^^
[rank0]: json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 46 (char 45)

[rank0]: During handling of the above exception, another exception occurred:

[rank0]: Traceback (most recent call last):
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/projects/tedscrap/debug.py", line 16, in <module>
[rank0]:     result = generator(
[rank0]:              ^^^^^^^^^^
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/outlines/generate/api.py", line 511, in __call__
[rank0]:     return format(completions)
[rank0]:            ^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/outlines/generate/api.py", line 497, in format
[rank0]:     return self.format_sequence(sequences)
[rank0]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/outlines/generate/json.py", line 50, in <lambda>
[rank0]:     generator.format_sequence = lambda x: schema_object.parse_raw(x)
[rank0]:                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^
[rank0]:   File "/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/site-packages/pydantic/main.py", line 1214, in parse_raw
[rank0]:     raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
[rank0]: pydantic_core._pydantic_core.ValidationError: 1 validation error for User
[rank0]: __root__
[rank0]:   Expecting ',' delimiter: line 1 column 46 (char 45) [type=value_error.jsondecode, input_value='{"name":"John Smith","last_name":"Doe","id":5', input_type=str]
ERROR 10-18 17:20:38 multiproc_worker_utils.py:117] Worker VllmWorkerProcess pid 2652845 died, exit code: -15
INFO 10-18 17:20:38 multiproc_worker_utils.py:121] Killing local vLLM worker processes
/home/NET1.CEC.EU.INT/cabermi/anaconda3/envs/outlines/lib/python3.11/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Outlines/Python version information:

python -c "from outlines import _version; print(_version.version)" python -c "import sys; print('Python', sys.version)" pip freeze 0.0.46 Python 3.11.10 (main, Oct 3 2024, 07:29:13) [GCC 11.2.0] accelerate==1.0.1 aiohappyeyeballs==2.4.3 aiohttp==3.10.10 aiosignal==1.3.1 airportsdata==20241001 annotated-types==0.7.0 anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1728935693959/work argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1692818318753/work argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1725356582126/work arrow @ file:///home/conda/feedstock_root/build_artifacts/arrow_1696128962909/work asttokens @ file:///home/conda/feedstock_root/build_artifacts/asttokens_1698341106958/work async-lru @ file:///home/conda/feedstock_root/build_artifacts/async-lru_1690563019058/work attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1722977137225/work Babel @ file:///home/conda/feedstock_root/build_artifacts/babel_1702422572539/work beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1705564648255/work bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1696630167146/work Brotli @ file:///home/conda/feedstock_root/build_artifacts/brotli-split_1725267488082/work cached-property @ file:///home/conda/feedstock_root/build_artifacts/cached_property_1615209429212/work certifi @ file:///home/conda/feedstock_root/build_artifacts/certifi_1725278078093/work/certifi cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1725560564262/work charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1728479282467/work click==8.1.7 cloudpickle==3.1.0 comm @ file:///home/conda/feedstock_root/build_artifacts/comm_1710320294760/work datasets==3.0.1 debugpy @ file:///home/conda/feedstock_root/build_artifacts/debugpy_1728594126643/work decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work defusedxml @ file:///home/conda/feedstock_root/build_artifacts/defusedxml_1615232257335/work dill==0.3.8 diskcache==5.6.3 distro==1.9.0 dnspython==2.7.0 einops==0.8.0 email_validator==2.2.0 entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work exceptiongroup @ file:///home/conda/feedstock_root/build_artifacts/exceptiongroup_1720869315914/work executing @ file:///home/conda/feedstock_root/build_artifacts/executing_1725214404607/work fastapi==0.115.2 fastjsonschema @ file:///home/conda/feedstock_root/build_artifacts/python-fastjsonschema_1718477020893/work/dist filelock==3.16.1 fqdn @ file:///home/conda/feedstock_root/build_artifacts/fqdn_1638810296540/work/dist frozenlist==1.4.1 fsspec==2024.6.1 gguf==0.10.0 guidance==0.1.16 h11 @ file:///home/conda/feedstock_root/build_artifacts/h11_1664132893548/work h2 @ file:///home/conda/feedstock_root/build_artifacts/h2_1634280454336/work hpack==4.0.0 httpcore @ file:///home/conda/feedstock_root/build_artifacts/httpcore_1727820890233/work httptools==0.6.2 httpx @ file:///home/conda/feedstock_root/build_artifacts/httpx_1724778349782/work huggingface-hub==0.25.2 hyperframe @ file:///home/conda/feedstock_root/build_artifacts/hyperframe_1619110129307/work idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1726459485162/work importlib_metadata @ file:///home/conda/feedstock_root/build_artifacts/importlib-metadata_1726082825846/work importlib_resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1725921340658/work interegular==0.3.3 ipykernel @ file:///croot/ipykernel_1728665589812/work ipython @ file:///home/conda/feedstock_root/build_artifacts/ipython_1727944696411/work ipython_genutils @ file:///home/conda/feedstock_root/build_artifacts/ipython_genutils_1716278396992/work ipywidgets @ file:///home/conda/feedstock_root/build_artifacts/ipywidgets_1724334859652/work isoduration @ file:///home/conda/feedstock_root/build_artifacts/isoduration_1638811571363/work/dist jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1696326070614/work Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1715127149914/work jiter==0.6.1 json5 @ file:///home/conda/feedstock_root/build_artifacts/json5_1712986206667/work jsonpointer @ file:///home/conda/feedstock_root/build_artifacts/jsonpointer_1725302941992/work jsonschema @ file:///home/conda/feedstock_root/build_artifacts/jsonschema_1720529478715/work jsonschema-specifications @ file:///tmp/tmpvslgxhz5/src jupyter==1.1.1 jupyter-console==6.6.3 jupyter-contrib-core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_contrib_core_1657548529421/work jupyter-contrib-nbextensions @ file:///home/conda/feedstock_root/build_artifacts/jupyter_contrib_nbextensions_1670068802953/work jupyter-events @ file:///home/conda/feedstock_root/build_artifacts/jupyter_events_1710805637316/work jupyter-highlight-selected-word @ file:///home/conda/feedstock_root/build_artifacts/jupyter_highlight_selected_word_1695322379939/work jupyter-latex-envs @ file:///home/conda/feedstock_root/build_artifacts/jupyter_latex_envs_1614852190293/work jupyter-lsp @ file:///home/conda/feedstock_root/build_artifacts/jupyter-lsp-meta_1712707420468/work/jupyter-lsp jupyter-nbextensions-configurator @ file:///home/conda/feedstock_root/build_artifacts/jupyter_nbextensions_configurator_1670793770953/work jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1673615989977/work jupyter_core @ file:///home/conda/feedstock_root/build_artifacts/jupyter_core_1727163409502/work jupyter_server @ file:///home/conda/feedstock_root/build_artifacts/jupyter_server_1720816649297/work jupyter_server_terminals @ file:///home/conda/feedstock_root/build_artifacts/jupyter_server_terminals_1710262634903/work jupyterlab @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_1724745148804/work jupyterlab_pygments @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_pygments_1707149102966/work jupyterlab_server @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_server-split_1721163288448/work jupyterlab_widgets @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_widgets_1724331334887/work lark==1.2.2 llvmlite==0.43.0 lm-format-enforcer==0.10.6 lxml @ file:///croot/lxml_1722882187815/work MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1728489060918/work matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1713250518406/work mistral_common==1.4.4 mistune @ file:///home/conda/feedstock_root/build_artifacts/mistune_1698947099619/work mpmath==1.3.0 msgpack==1.1.0 msgspec==0.18.6 multidict==6.1.0 multiprocess==0.70.16 nbclassic @ file:///home/conda/feedstock_root/build_artifacts/nbclassic_1716838762700/work nbclient @ file:///home/conda/feedstock_root/build_artifacts/nbclient_1710317608672/work nbconvert @ file:///home/conda/feedstock_root/build_artifacts/nbconvert-meta_1718135430380/work nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1712238998817/work nest_asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1705850609492/work networkx==3.4.1 notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1715848908871/work notebook_shim @ file:///home/conda/feedstock_root/build_artifacts/notebook-shim_1707957777232/work numba==0.60.0 numpy==1.26.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.560.30 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.6.77 nvidia-nvtx-cu12==12.1.105 openai==1.51.2 opencv-python-headless==4.10.0.84 ordered-set==4.1.0 outlines==0.0.46 outlines_core==0.1.14 overrides @ file:///home/conda/feedstock_root/build_artifacts/overrides_1706394519472/work packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1718189413536/work pandas==2.2.3 pandocfilters @ file:///home/conda/feedstock_root/build_artifacts/pandocfilters_1631603243851/work parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1712320355065/work partial-json-parser==0.2.1.1.post4 pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1706113125309/work pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work pillow==10.4.0 pkgutil_resolve_name @ file:///home/conda/feedstock_root/build_artifacts/pkgutil-resolve-name_1694617248815/work platformdirs @ file:///home/conda/feedstock_root/build_artifacts/platformdirs_1726613481435/work prometheus-fastapi-instrumentator==7.0.0 prometheus_client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1726901976720/work prompt_toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1727341649933/work propcache==0.2.0 protobuf==5.28.2 psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1728965152023/work ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl pure_eval @ file:///home/conda/feedstock_root/build_artifacts/pure_eval_1721585709575/work py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==17.0.0 pycountry==24.6.1 pycparser @ file:///home/conda/feedstock_root/build_artifacts/pycparser_1711811537435/work pydantic==2.9.2 pydantic_core==2.23.4 Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1714846767233/work PySocks @ file:///home/conda/feedstock_root/build_artifacts/pysocks_1661604839144/work python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-json-logger @ file:///home/conda/feedstock_root/build_artifacts/python-json-logger_1677079630776/work pytz @ file:///home/conda/feedstock_root/build_artifacts/pytz_1726055524169/work PyYAML @ file:///home/conda/feedstock_root/build_artifacts/pyyaml_1725456139051/work pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1728642222605/work ray==2.37.0 referencing @ file:///home/conda/feedstock_root/build_artifacts/referencing_1714619483868/work regex==2024.9.11 requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1717057054362/work rfc3339-validator @ file:///home/conda/feedstock_root/build_artifacts/rfc3339-validator_1638811747357/work rfc3986-validator @ file:///home/conda/feedstock_root/build_artifacts/rfc3986-validator_1598024191506/work rpds-py @ file:///home/conda/feedstock_root/build_artifacts/rpds-py_1725327039958/work safetensors==0.4.5 Send2Trash @ file:///home/conda/feedstock_root/build_artifacts/send2trash_1712584999685/work sentencepiece==0.2.0 six @ file:///home/conda/feedstock_root/build_artifacts/six_1620240208055/work sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1708952932303/work soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1693929250441/work stack-data @ file:///home/conda/feedstock_root/build_artifacts/stack_data_1669632077133/work starlette==0.40.0 sympy==1.13.3 terminado @ file:///home/conda/feedstock_root/build_artifacts/terminado_1710262609923/work tiktoken==0.7.0 tinycss2 @ file:///home/conda/feedstock_root/build_artifacts/tinycss2_1713974937325/work tokenizers==0.20.1 tomli @ file:///home/conda/feedstock_root/build_artifacts/tomli_1727974628237/work torch==2.4.0 torchvision==0.19.0 tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1724956126282/work tqdm==4.66.5 traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1713535121073/work transformers==4.45.2 triton==3.0.0 types-python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/types-python-dateutil_1727940235703/work typing-utils @ file:///home/conda/feedstock_root/build_artifacts/typing_utils_1622899189314/work typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1717802530399/work tzdata==2024.2 uri-template @ file:///home/conda/feedstock_root/build_artifacts/uri-template_1688655812972/work/dist urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1726496430923/work uvicorn==0.32.0 uvloop==0.21.0 vllm==0.6.3 watchfiles==0.24.0 wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1704731205417/work webcolors @ file:///home/conda/feedstock_root/build_artifacts/webcolors_1723294704277/work webencodings @ file:///home/conda/feedstock_root/build_artifacts/webencodings_1694681268211/work websocket-client @ file:///home/conda/feedstock_root/build_artifacts/websocket-client_1713923384721/work websockets==13.1 widgetsnbextension @ file:///home/conda/feedstock_root/build_artifacts/widgetsnbextension_1724331337528/work xformers==0.0.27.post2 xxhash==3.5.0 yarl==1.15.3 zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1726248574750/work zstandard==0.23.0

Context for the issue:

No response

cpfiffer commented 1 month ago

I believe this is #1173. vLLM has a relatively low max_tokens default, which can cause decode errors due to early termination.

You should be able to run

from pydantic import BaseModel
from outlines import models, generate

class User(BaseModel):
    name: str
    last_name: str
    id: int

model = models.vllm(
    "microsoft/Phi-3-mini-4k-instruct", 
    tensor_parallel_size=4
)

generator = generate.json(model, User)
print("generator OK")
result = generator(
    "Create a user profile with the fields name, last_name and id",
    max_tokens=30000 # this determines your maximum tokens
)
print(result)
lemig commented 1 month ago

Thanks that works!

lemig commented 1 month ago

max_tokens params solves issue

EricLee8 commented 3 weeks ago

FYI, when using VLLM server, you can add {"max_tokens": 1024} in the Curl request or python requests.