Mixtral Instruct tokenizer from Colab notebook doesn't work.

jmuntaner-smd commented 2 weeks ago

When running the Google Colab notebook, it looks like there is some error when loading the Mixtral Instruct Tokenizer:

[/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_fast.py](https://localhost:8080/#) in __init__(self, *args, **kwargs)
    109         elif fast_tokenizer_file is not None and not from_slow:
    110             # We have a serialization from tokenizers which let us directly build the backend
--> 111             fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
    112         elif slow_tokenizer is not None:
    113             # We need to convert a slow tokenizer to build the backend

Exception: data did not match any variant of untagged enum PyPreTokenizerTypeWrapper at line 40 column 3

This appears to be a bug with the transformers and tokenizer versions (see: https://github.com/huggingface/transformers/issues/31789), so the requirements.txt probably need to be updated. But i haven't been able to fix it properly. I changed the tokenizer to the base Mixtral model, but it's not the proper solution.

kaushikacharya commented 2 weeks ago

`> I changed the tokenizer to the base Mixtral model, but it's not the proper solution.

`What is the tokenizer version that you are using? I am also facing a similar issue.

The issue seems to be due to recent commits in https://huggingface.co/mistralai/Mixtral-8x7B-Instruct-v0.1/commits/main

jmuntaner-smd commented 2 weeks ago

I just changed the google colab line to this: tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")

dvmazur / mixtral-offloading

Mixtral Instruct tokenizer from Colab notebook doesn't work. #38