cohere-ai / cohere-python

Python Library for Accessing the Cohere API
https://docs.cohere.ai
MIT License
269 stars 55 forks source link

Offline tokenization produces empty token_strings #493

Open yifanmai opened 2 months ago

yifanmai commented 2 months ago

When I run the script on this doc: https://docs.cohere.com/reference/tokenize

response = co.tokenize(text="tokenize me! :D", model="command")

I get:

tokens=[10002, 2261, 2012, 8, 2792, 43] token_strings=[] meta=None

where token_strings is an empty array, even thought the docs suggests that it should be non-empty. However, if I run:

response = co.tokenize(text="tokenize me! :D", model="command", offline=False)

I get the token_strings as expected:

tokens=[10002, 2261, 2012, 8, 2792, 43] token_strings=['token', 'ize', ' me', '!', ' :', 'D'] meta=ApiMeta(api_version=ApiMetaApiVersion(version='1', is_deprecated=None, is_experimental=None), billed_units=None, tokens=None, warnings=None)

It would be nice if token_strings could be supported for offline tokenization, so that the online and offline behavior is identical. I'll attach a pull request for how this could be done.

elaineg commented 2 months ago

Hi, thanks for catching any discrepancies in documentation, we had updated https://docs.cohere.com/docs/tokens-and-tokenizers#tokenization-in-python-sdk and the release note https://docs.cohere.com/changelog/python-sdk-v520-release.

Do you use the token_strings? I wonder if it would be acceptable to remove them from the network call to achieve identical behaviour.

yifanmai commented 2 months ago

Yes, token_strings removing from the network call would also make things more uniform.

I have a use case that uses token_strings, however, I can work around this issue - I can get the token strings by using the Hugging Face tokenizers library directly with the downloaded tokenizer.json files.

yifanmai commented 2 months ago

Another alternative would be to add a parameter that controls whether token_strings are returned (in both the library and the server API).