Logits Processors `Guide` integration will be buggy when `len(tokens) > 1` in a `Write` instruction

br3no commented 6 months ago

Describe the issue as clearly as possible:

See: https://github.com/outlines-dev/outlines/blob/d6a2b7908065d420456118723f69908c4094c1f8/outlines/integrations/vllm.py#L110

Here the tokens field of the next instruction is treated equally regardless of whether it is of type Generate or Write.

If a Write instruction has a tokens field with length > 1, this means we will accept any of the next ff-tokens as the token in the next step. This is incorrect.

Steps/code to reproduce the bug:

The bug will only appear once the ff-tokens have length > 1.

Expected result:

N.a.

Error message:

No response

Outlines/Python version information:

Version information

``` python -c "import sys; print('Python', sys.version)" pip freeze 0.0.39 Python 3.8.18 (default, Oct 2 2023, 15:02:11) [GCC 9.4.0] accelerate==0.27.2 ai2-olmo==0.2.5 aiohttp==3.9.3 aioprometheus==23.12.0 aiosignal==1.3.1 annotated-types==0.6.0 antlr4-python3-runtime==4.9.3 anyio==4.2.0 async-timeout==4.0.3 attrs==23.2.0 awscli==1.32.83 boto3==1.34.83 botocore==1.34.83 cached_path==1.6.2 cachetools==5.3.3 certifi==2024.2.2 charset-normalizer==3.3.2 click==8.1.7 cloudpickle==3.0.0 cmake==3.29.2 codespell==2.2.6 colorama==0.4.4 cupy-cuda12x==12.1.0 diskcache==5.6.3 distro==1.9.0 docutils==0.16 einops==0.7.0 exceptiongroup==1.2.0 fastapi==0.109.2 fastrlock==0.8.2 filelock==3.13.1 frozenlist==1.4.1 fsspec==2024.2.0 google-api-core==2.18.0 google-auth==2.29.0 google-cloud-core==2.4.1 google-cloud-storage==2.16.0 google-crc32c==1.5.0 google-resumable-media==2.7.0 googleapis-common-protos==1.63.0 h11==0.14.0 hiredis==2.3.2 httpcore==1.0.4 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.20.3 idna==3.6 importlib-resources==6.1.1 importlib_metadata==7.0.2 iniconfig==2.0.0 interegular==0.3.3 isort==5.13.2 Jinja2==3.1.3 jmespath==1.0.1 joblib==1.3.2 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 lark==1.1.9 libnacl==2.1.0 llvmlite==0.41.1 lm-format-enforcer==0.9.3 markdown-it-py==3.0.0 MarkupSafe==2.1.5 mdurl==0.1.2 mpmath==1.3.0 msgpack==1.0.7 multidict==6.0.5 mypy==1.9.0 mypy-extensions==1.0.0 nest-asyncio==1.6.0 networkx==3.1 ninja==1.11.1.1 numba==0.58.1 numpy==1.24.4 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu12==2.18.1 nvidia-nvjitlink-cu12==12.3.101 nvidia-nvtx-cu12==12.1.105 omegaconf==2.3.0 openai==1.13.3 orjson==3.9.13 outlines==0.0.39 packaging==23.2 peft==0.9.0 pillow==10.3.0 pkgutil_resolve_name==1.3.10 pluggy==1.4.0 prometheus_client==0.20.0 proto-plus==1.23.0 protobuf==4.25.2 psutil==5.9.8 py==1.11.0 py-cpuinfo==9.0.0 pyasn1==0.6.0 pyasn1_modules==0.4.0 pydantic==2.6.1 pydantic_core==2.16.2 Pygments==2.17.2 pynvml==11.5.0 pytest==8.0.2 pytest-asyncio==0.23.5 pytest-forked==1.6.0 pytest-rerunfailures==13.0 pytest-shard==0.1.2 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 PyYAML==6.0.1 quantile-python==1.1 ray==2.9.2 redis==5.0.3 referencing==0.33.0 regex==2023.12.25 requests==2.31.0 rich==13.7.1 rpds-py==0.17.1 rsa==4.7.2 ruff==0.1.5 s3transfer==0.10.1 safetensors==0.4.2 scipy==1.10.1 sentencepiece==0.1.99 six==1.16.0 sniffio==1.3.0 starlette==0.36.3 sympy==1.12 tensorizer==2.9.0a0 tiktoken==0.6.0 tokenizers==0.19.1 toml==0.10.2 tomli==2.0.1 torch==2.1.2 tqdm==4.66.1 transformers==4.40.0 triton==2.1.0 types-PyYAML==6.0.12.12 types-requests==2.31.0.6 types-setuptools==69.1.0.20240308 types-urllib3==1.26.25.14 typing_extensions==4.9.0 urllib3==1.26.18 uvicorn==0.27.0.post1 uvloop==0.19.0 -e git+ssh://git@github.com/br3no/vllm.git@e6ffb1af2e904436473568f9d43131c998e55063#egg=vllm watchfiles==0.21.0 websockets==12.0 xformers==0.0.23.post1 yapf==0.32.0 yarl==1.9.4 zipp==3.17.0 ```

Context for the issue:

Bug was discussed in a call with @rlouf.

ekagra-ranjan commented 5 months ago

Hi @br3no , I have a couple of questions on this Issue. Can you pls share more detail on these?

what is ff-tokens?
I had this doubt before too because in the codebase it seems Write and Generate are used interchangeably. What is the difference bw Write and Generate?

br3no commented 5 months ago

ff-tokens are fast-forward tokens. When you are generating guided output, e.g. a json object, there are moments when you don't really need an LLM to generate the next tokens, because the next tokens are specified by the guide. This reduces the load on the GPU and is generally much faster, as you only need to traverse the state-machine.

Write and Generate are instructions. A Generate instruction signals that the next step in the sequence requires an LLM generation. The tokens member variable contains the valid next tokens in the sequence, according to the guide (the state machine). A Write instruction signals that the next step(s) in the sequence does not require an LLM generation. The tokens member variable then contains the next tokens in the sequence.

ekagra-ranjan commented 5 months ago

Thank you @br3no ! Much appreciated!

lapp0 commented 3 months ago

@br3no are there any new developments for fast-forward / accelerate?

kousun12 commented 2 months ago

also curious about the state of things here.

@simon-mo @rlouf do you know the latest on this?

dottxt-ai / outlines