[Bug]: Use huggingface tokenizer for mistral token counting

krrishdholakia commented 7 months ago

What happened?

Request to litellm: litellm.completion(model='bedrock/mistral.mixtral-8x7b-instruct-v0:1', messages=[{'role': 'user', 'content': 'Are you here? Answer "Yes."'}], max_tokens=3, stream=True)

Relevant log output

No response

Twitter / LinkedIn details

cc: @GlavitsBalazs

GlavitsBalazs commented 7 months ago

At the risk of greatly changing the scope of this bug report, I'd like to mention that the _select_tokenizer method only selects the correct tokenizer for the few hard coded model classes (namely Cohere, Anthropic, LLaMA2, OpenAI) and for every other model (including Mistral) it incorrectly defaults to tiktoken.

krrishdholakia commented 7 months ago

open to suggestions - how can we improve this @GlavitsBalazs

GlavitsBalazs commented 7 months ago

As a fallback, I would add a Tokenizer.from_pretrained call inside of a try-except to _select_tokenizer to check if the model is available on HuggingFace and use the HF tokenizer if nothing better is available. This would work with huggingface/mistralai/Mixtral-8x7B-Instruct-v0.1 or togetherai/mistralai/Mixtral-8x7B-Instruct-v0.1 for example.

For models such as bedrock/mistral.mixtral-8x7b-instruct-v0:1, there is no way to automatically deduce the correct tokenizer, so I would set litellm.model_cost["bedrock/mistral.mixtral-8x7b-instruct-v0:1"]["huggingface_tokenizer"] = "mistralai/Mixtral-8x7B-Instruct-v0.1" via an entry in model_prices_and_context_window.json. This could be done for many other models, such as Cohere, Mistral, LLaMA, etc. Other options for the name `huggingface_tokenizer" could be "huggingface_tokenizer_name", "huggingface_tokenizer_repo", or "huggingface_tokenizer_model_name_or_path".

Then _select_tokenizer would call litellm.get_model_info and if the info has a "huggingface_tokenizer" we can fetch that and be sure that it's correct. Users could even customize this via litellm.register_model.

Finally, I would add a functools.lru_cache decorator to _select_tokenizer, so that we don't have to load the tokenizer from disk or send network requests every time someone wants to tokenize something.

BerriAI / litellm