Mistralfixes - Githubissues

Progress towards #26

There are two issues identified in main.

Mistral is ending up with two <s> beggining of sentence tokens. We redefine behaviour to check whether we actually need to add a bos by default or detect if the tokenizer is doing it for us. This is already complete with f71c302a89.

There is issues with not having proper whitespace in the first token. This reproduced in test test_hello_world_prompt. We note that self._tokenizer.convert_ids_to_tokens appears correct for mistral showing

['<s>', '<s>', '▁def', '▁hell', 'o', '_', 'world', '():', '<0x0A>', '▁▁▁', '▁"""', 'print', "▁'", 'hello', '▁world', "'", '"""', '<0x0A>', '▁▁▁', '▁print', '▁"', 'hello', '_', 'world', '"', ...]

, but output_tokens = [self._tokenizer.decode(t) for t in output_sequence] cuts out the spaces

['<s>', '<s>', 'def', 'hell', 'o', '_', 'world', '():', '\n', '  ', '"""', 'print', "'", 'hello', 'world', "'", '"""', '\n', '  ', 'print', '"', 'hello', '_', 'world', '"'. ...]

I still working on how to fix this.

DNGros / lmwrapper

Mistralfixes #28