DNGros / lmwrapper

An object-oriented wrapper around language models (like openai endpoints or huggingface)
1 stars 1 forks source link

Mistralfixes #28

Closed DNGros closed 8 months ago

DNGros commented 8 months ago

Progress towards #26

There are two issues identified in main.

  1. Mistral is ending up with two <s> beggining of sentence tokens. We redefine behaviour to check whether we actually need to add a bos by default or detect if the tokenizer is doing it for us. This is already complete with f71c302a89.
  2. There is issues with not having proper whitespace in the first token. This reproduced in test test_hello_world_prompt. We note that self._tokenizer.convert_ids_to_tokens appears correct for mistral showing
    ['<s>', '<s>', '▁def', '▁hell', 'o', '_', 'world', '():', '<0x0A>', '▁▁▁', '▁"""', 'print', "▁'", 'hello', '▁world', "'", '"""', '<0x0A>', '▁▁▁', '▁print', '▁"', 'hello', '_', 'world', '"', ...]

    , but output_tokens = [self._tokenizer.decode(t) for t in output_sequence] cuts out the spaces

    ['<s>', '<s>', 'def', 'hell', 'o', '_', 'world', '():', '\n', '  ', '"""', 'print', "'", 'hello', 'world', "'", '"""', '\n', '  ', 'print', '"', 'hello', '_', 'world', '"'. ...]

    I still working on how to fix this.

DNGros commented 8 months ago

With 35d399742d19e90c the 2nd point is fixed, but still broken with getting the right token (and thus broken when using stop tokens). With @claudiosv we isolated likely source of bug as https://github.com/huggingface/transformers/blob/609a1767e8ba367350abf3c553d40b68607987e5/src/transformers/models/llama/tokenization_llama.py#L287