dottxt-ai / outlines

Structured Text Generation
https://dottxt-ai.github.io/outlines/
Apache License 2.0
9.6k stars 493 forks source link

Validation Error during pydantic validation for Llama3 GGUF #952

Closed polplop closed 5 months ago

polplop commented 5 months ago

Describe the issue as clearly as possible:

I'm currently attempting to summarize an article and classify the relevancy, which worked fine on outlines 0.0.36, however upgrading to outlines 0.0.43 produces a validation error which did not occur before.

I have tried:

The model seems to be unable to generate valid json and there is an "Invalid control character at" bug that occurs during pydantic validation

notes: Running on #20~22.04.1-Ubuntu, AWS instance with A10G GPU, Cuda 12.1 llama_cpp_python==0.2.77 outlines==0.0.43

Steps/code to reproduce the bug:

from outlines import models, generate
import llama_cpp

from pydantic import BaseModel

# 0.0.36
# model = models.llamacpp(
#                         # "./models/bartowski_Meta-Llama-3-8B-Instruct-Q8_0.gguf",
#                         n_ctx=8000,
#                         n_gpu_layers=-1,  # to use GPU acceleration
#                         )

# 0.0.43
model = models.llamacpp("bartowski/Meta-Llama-3-8B-Instruct-GGUF",
                        "Meta-Llama-3-8B-Instruct-Q8_0.gguf",
                        tokenizer=llama_cpp.llama_tokenizer.LlamaHFTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
                        n_ctx=8000,
                        n_gpu_layers=-1,  # to use GPU acceleration
                        )

class User(BaseModel):
    name: str
    last_name: str
    id: int

class RelevantSummary(BaseModel):
    relevant_summary: str

generator = generate.json(model, RelevantSummary)

result = generator(
"""
<|start_header_id|>system<|end_header_id|>
<|eot_id|>
## OBJECTIVE
1. Write a detailed summary related to Product Announcements.
2. Output your answer in JSON

<|eot_id|>
<|start_header_id|>user<|end_header_id|>

## ARTICLE
VeriSilicon’s 2nd generation automotive ISP series IP passed ISO 26262 ASIL B and ASIL D certifications

Las Vegas, USA, January 8, 2024--VeriSilicon (688521.SH) today announced its Image Signal Processor (ISP) IP ISP8200-FS and ISP8200L-FS, designed for high-performance automotive applications, have been certified compliant with the ISO 26262 automotive functional safety standard, achieving ASIL B certification for random failures and ASIL D certification for systematic failures, respectively. The certifications were granted by ResilTech, a leading safety consultancy company. Building upon the 1st generation of ISO 26262 certified ISP IP, the ISP8200-FS series is updated with advanced ISP technologies and several crucial enhancements for automotive applications after multiple automotive customers’ engagements on the 1st generation version.

ISP8200-FS series automotive ISP IP delivers high pixel throughputs from 1.6Giga to 2Giga pixel per second under different process technologies, supports up to 8 real-time or 16 camera streams from DDR with low latency technology based on multi-camera scheduling mechanism, and supplements the raw pixel processing pipelines for efficient AI processing. In addition, ISP8200-FS has a built-in FLEXA AI interface to capture automotive related ROI objects from AI processor for pedestrians, vehicles, traffic lights and signs detecting and processing.

Since its launch, multiple global major automotive SoC vendors have adopted ISP8200-FS series IP in their products for in-cabin ADAS, the next generation autonomous driving, and unified autonomous driving applications.

“ISP plays a pivotal role in the realm of autonomous driving. To meet the rapidly evolving demands of this industry, VeriSilicon is dedicated to providing our automotive customers with cutting-edge capabilities through our functional safety certified IPs,” said Wei-Jin Dai, Executive VP and GM of IP Division of VeriSilicon. “With adoption by multiple customers worldwide, our certified ISP8200-FS and ISP8200L-FS are specifically designed to cater to both primary application processor and the sensor fusion SoC requirements, including image, radar, and LiDAR capabilities. Minimizing latency from sensing to action is crucial in automotive applications. VeriSilicon offers a comprehensive solution with its Glass-to-Glass intelligent pixel processing functional safety IPs.”

To explore our rich IP portfolios, we invite you to visit VeriSilicon’s booth at the Venetian Expo (Booth No.: Bassano 2701 & Bassano 2702) during the Consumer Electronics Show (CES) 2024, taking place from January 9 to January 12 in Las Vegas.

## SUMMARY
<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
"""
, max_tokens=5000)
print(result)

Expected result:

### Results from outlines==0.0.36
relevant_summary="Verisilicon's second generation automotive ISP series IP has passed ISO26262 ASIL B and ASIL D certifications. The ISP8200-FS and ISP8200L-FS IPs are designed for high-performance automotive applications, achieving ASIL B certification for random failures and ASIL D certification for systematic failures respectively. They deliver high pixel throughputs, support multiple camera streams with low latency, and have a built-in FLEXA AI interface. Multiple major automotive SoC vendors have adopted these IPs in their products for in-cabin ADAS, autonomous driving, and unified autonomous driving applications."

Error message:

$ python3 test_outlines.py 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Compiling FSM index for all state transitions:  76%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▍                                                      | 25/33 [00:03<00:01,  7.24it/s]
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/pydantic/main.py", line 1143, in parse_raw
    obj = parse.load_str_bytes(
  File "/usr/local/lib/python3.10/dist-packages/pydantic/deprecated/parse.py", line 49, in load_str_bytes
    return json_loads(b)  # type: ignore
  File "/usr/lib/python3.10/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.10/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.10/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Invalid control character at: line 1 column 22 (char 21)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/test_outlines.py", line 32, in <module>
    result = generator(
  File "/home/ubuntu/.local/lib/python3.10/site-packages/outlines/generate/api.py", line 511, in __call__
    return format(completions)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/outlines/generate/api.py", line 497, in format
    return self.format_sequence(sequences)
  File "/home/ubuntu/.local/lib/python3.10/site-packages/outlines/generate/json.py", line 50, in <lambda>
    generator.format_sequence = lambda x: schema_object.parse_raw(x)
  File "/usr/local/lib/python3.10/dist-packages/pydantic/main.py", line 1170, in parse_raw
    raise pydantic_core.ValidationError.from_exception_data(cls.__name__, [error])
pydantic_core._pydantic_core.ValidationError: 1 validation error for RelevantSummary
__root__
  Invalid control character at: line 1 column 22 (char 21) [type=value_error.jsondecode, input_value='{"relevant_summary":"\nV...ion SoC requirements."}', input_type=str]

Outlines/Python version information:

Version information

``` ubuntu@ip-:~$ python3 -c "from outlines import _version; print(_version.version)". 0.0.43 ubuntu@ip-:~$ python3 -c "import sys; print('Python', sys.version)". Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] pip freeze aiohttp==3.9.5 aiosignal==1.3.1 amqp==5.2.0 annotated-types==0.6.0 anyio==4.3.0 astroid==3.2.2 asttokens==2.4.1 async-timeout==4.0.3 attrs==23.2.0 Automat==22.10.0 awscli==1.32.108 Babel==2.15.0 backports.tarfile==1.1.1 bcrypt==3.2.0 billiard==4.2.0 black==24.4.2 blessed==1.20.0 blinker==1.8.2 boto3==1.34.108 botocore==1.34.108 build==1.2.1 celery==5.4.0 certifi==2024.2.2 cffi==1.16.0 chalice==1.31.0 chardet==4.0.0 charset-normalizer==3.3.2 click==8.1.7 click-didyoumean==0.3.1 click-plugins==1.1.1 click-repl==0.3.0 cloud-init==24.1.3 cloudpickle==3.0.0 cmake==3.29.3 colorama==0.4.6 command-not-found==0.3 configobj==5.0.6 constantly==23.10.4 cryptography==42.0.7 cssselect==1.2.0 dask==2024.5.1 datasets==2.19.1 dbus-python==1.2.18 decorator==5.1.1 defusedxml==0.7.1 devscripts===2.22.1ubuntu1 dill==0.3.8 diskcache==5.6.3 distlib==0.3.8 distro==1.7.0 distro-info==1.1+ubuntu0.2 dnspython==2.6.1 docutils==0.16 dparse==0.6.3 ec2-hibinit-agent==1.0.0 email_validator==2.1.1 exceptiongroup==1.2.1 executing==2.0.1 fastapi==0.111.0 fastapi-cli==0.0.4 filelock==3.14.0 Flask==3.0.3 frozenlist==1.4.1 fsspec==2024.3.1 gpg==1.16.0 greenlet==3.0.3 h11==0.14.0 hibagent==1.0.1 httpcore==1.0.5 httpie==3.2.2 httplib2==0.22.0 httptools==0.6.1 httpx==0.27.0 huggingface-hub==0.23.2 hyperlink==21.0.0 idna==3.7 importlib_metadata==7.1.0 incremental==22.10.0 iniconfig==2.0.0 inquirer==2.10.1 inquirerpy==0.3.4 interegular==0.3.3 ipython==8.24.0 isort==5.13.2 itemadapter==0.9.0 itemloaders==1.2.0 itsdangerous==2.2.0 jaraco.classes==3.4.0 jaraco.context==5.3.0 jaraco.functools==4.0.1 jedi==0.19.1 jeepney==0.8.0 Jinja2==3.1.4 jmespath==1.0.1 joblib==1.4.2 jsonpatch==1.32 jsonpointer==2.0 jsonschema==4.21.1 jsonschema-specifications==2023.12.1 keyring==25.2.1 kombu==5.3.7 lark==1.1.9 launchpadlib==1.10.16 lazr.restfulclient==0.14.4 lazr.uri==1.0.6 llama_cpp_python==0.2.77 llvmlite==0.42.0 lm-format-enforcer==0.10.1 locket==1.0.0 lxml==5.2.2 markdown-it-py==3.0.0 MarkupSafe==2.1.5 matplotlib-inline==0.1.7 mccabe==0.7.0 mdurl==0.1.2 more-itertools==10.2.0 mpmath==1.3.0 msgpack==1.0.8 multidict==6.0.5 multiprocess==0.70.16 mypy==1.10.0 mypy-extensions==1.0.0 nest-asyncio==1.6.0 netifaces==0.11.0 networkx==3.3 nh3==0.2.17 ninja==1.11.1.1 numba==0.59.1 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu12==12.1.0.106 nvidia-ml-py==12.550.52 nvidia-nccl-cu12==2.20.5 nvidia-nvjitlink-cu12==12.5.40 nvidia-nvtx-cu12==12.1.105 oauthlib==3.2.2 olefile==0.46 openai==1.31.1 orjson==3.10.3 outlines==0.0.43 packaging==21.3 pandas==2.2.2 parsel==1.9.1 parso==0.8.4 partd==1.4.2 pathspec==0.12.1 pbr==6.0.0 pexpect==4.9.0 pfzy==0.3.4 pillow==10.3.0 pip-tools==7.4.1 pipdeptree==2.20.0 pipenv==2023.12.1 pkginfo==1.10.0 platformdirs==4.2.2 pluggy==1.5.0 prometheus-fastapi-instrumentator==7.0.0 prometheus_client==0.20.0 prompt-toolkit==3.0.43 Protego==0.3.1 protobuf==5.27.0 psutil==5.9.8 psycopg2-binary==2.9.9 ptyprocess==0.7.0 pure-eval==0.2.2 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==16.1.0 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1_modules==0.4.0 pycairo==1.26.0 pycountry==24.6.1 pycparser==2.22 pydantic==2.7.1 pydantic_core==2.18.2 PyDispatcher==2.0.7 Pygments==2.18.0 PyGObject==3.42.1 PyHamcrest==2.0.2 PyJWT==2.8.0 pylint==3.2.1 pyOpenSSL==24.1.0 pyparsing==3.1.2 pyproject_hooks==1.1.0 pyrsistent==0.18.1 pyserial==3.5 PySocks==1.7.1 pytest==8.2.1 python-apt==2.4.0+ubuntu3 python-dateutil==2.9.0.post0 python-debian==0.1.43+ubuntu1.1 python-dotenv==1.0.1 python-editor==1.0.4 python-magic==0.4.24 python-multipart==0.0.9 pytz==2022.1 pyxdg==0.27 PyYAML==6.0.1 queuelib==1.7.0 ray==2.23.0 readchar==4.1.0 readme_renderer==43.0 redis==5.0.4 referencing==0.35.1 regex==2024.5.15 requests==2.31.0 requests-file==2.0.0 requests-toolbelt==1.0.0 rfc3986==2.0.0 rich==13.7.1 roman==3.3 rpds-py==0.18.1 rsa==4.7.2 ruamel.yaml==0.18.6 ruamel.yaml.clib==0.2.8 s3transfer==0.10.1 safetensors==0.4.3 scikit-learn==1.5.0 scipy==1.13.1 Scrapy==2.11.2 SecretStorage==3.3.3 sentence-transformers==3.0.0 sentencepiece==0.2.0 service-identity==24.1.0 shellingham==1.5.4 six==1.16.0 sniffio==1.3.1 sos==4.5.6 SQLAlchemy==2.0.30 ssh-import-id==5.11 stack-data==0.6.3 starlette==0.37.2 sympy==1.12.1 systemd-python==234 testresources==2.0.1 threadpoolctl==3.5.0 tiktoken==0.7.0 tldextract==5.1.2 tokenizers==0.19.1 tomli==2.0.1 tomlkit==0.12.5 toolz==0.12.1 torch==2.3.0 tqdm==4.66.4 traitlets==5.14.3 transformers==4.41.2 triton==2.3.0 twine==5.1.0 Twisted==24.3.0 typer==0.12.3 typing_extensions==4.11.0 tzdata==2024.1 ubuntu-pro-client==8001 ufw==0.36.1 ujson==5.10.0 unattended-upgrades==0.1 unidiff==0.5.5 urllib3==2.2.1 uvicorn==0.29.0 uvloop==0.19.0 vine==5.1.0 virtualenv==20.26.2 vllm-flash-attn==2.5.8.post2 w3lib==2.1.2 wadllib==1.3.6 watchfiles==0.21.0 wcwidth==0.2.13 websockets==12.0 Werkzeug==3.0.3 xdg==5 xformers==0.0.26.post1 xxhash==3.4.1 yapf==0.40.2 yarl==1.9.4 zipp==3.18.2 zope.interface==6.4 ```

Context for the issue:

I would like to improve the performance of my summarization and classification pipeline with the newer Llama 3 gguf models. The current performance on the older 0.0.36 outlines library also has some number formatting issues.

No other issue has brought up any problems with Llama3 ggufs, but all of the finetunes I have tried also have the same issue. Either i'm doing something wrong or there is a signification Llama 3 gguf issue that there should be a discussion about. Thank you!

lapp0 commented 5 months ago

Investigating a solution.

Related: https://github.com/huggingface/transformers/issues/31030

Problem:

It appears the tokenizer represents 198 differently between tokenizer.vocabulary() and tokenizer.decode()

>>> tokenizer.decode([198])
['\n']
>>> [(k, v) for k, v in tokenizer.vocabulary().items() if v == 198][0][0].encode()
'Ċ'

This isn't the case for other tokens

>>> tokenizer.decode([10])
['+']
>>> [(k, v) for k, v in tokenizer.vocabulary().items() if v == 10][0][0]
'+'

Inconsistent Tokens

    from transformers import AutoTokenizer
    tokenizer = TransformerTokenizer(
        AutoTokenizer.from_pretrained("failspy/Meta-Llama-3-8B-Instruct-abliterated-v3")
    )
    bad_tokens = []
    for vocab_token_str, token_id in tokenizer.vocabulary.items():
        decoded_token_str = tokenizer.decode([token_id])[0]
        if decoded_token_str != vocab_token_str:
            bad_tokens.append((decoded_token_str, vocab_token_str))

    if bad_tokens:
        bad_tok_output = '\n'.join(map(repr, bad_tokens))
        raise Exception(f"Found {len(bad_tokens)} bad tokens: {bad_tok_output}")

Found these inconsistent tokens:

E           Exception: Found 78029 bad tokens: (' ROOM', 'ĠROOM')
E           (' 않는', 'ĠìķĬëĬĶ')
E           (' Overse', 'ĠOverse')
E           (' slov', 'Ġslov')
E           ('�', 'æ¦')
E           (' Infragistics', 'ĠInfragistics')
E           ('�', 'çĻ')
E           (' DIFF', 'ĠDIFF')
E           (' 武', 'ĠæѦ')
E           (' eighth', 'Ġeighth')
...

I'm looking into whether we should be constructing a "true vocabulary" by decoding each token.

Edit:

It appears we already have a method to normalize:

class TransformerTokenizer(Tokenizer):
    ...
    def convert_token_to_string(self, token: str) -> str:
        from transformers.file_utils import SPIECE_UNDERLINE

        string = self.tokenizer.convert_tokens_to_string([token])

Investigating the reason this failed to prevent a \n during generation.