Closed DNGros closed 8 months ago
With 35d399742d19e90c the 2nd point is fixed, but still broken with getting the right token (and thus broken when using stop tokens). With @claudiosv we isolated likely source of bug as https://github.com/huggingface/transformers/blob/609a1767e8ba367350abf3c553d40b68607987e5/src/transformers/models/llama/tokenization_llama.py#L287
Progress towards #26
There are two issues identified in main.
<s>
beggining of sentence tokens. We redefine behaviour to check whether we actually need to add a bos by default or detect if the tokenizer is doing it for us. This is already complete with f71c302a89.test_hello_world_prompt
. We note thatself._tokenizer.convert_ids_to_tokens
appears correct for mistral showing, but
output_tokens = [self._tokenizer.decode(t) for t in output_sequence]
cuts out the spacesI still working on how to fix this.