Open yifanmai opened 2 months ago
Hi, thanks for catching any discrepancies in documentation, we had updated https://docs.cohere.com/docs/tokens-and-tokenizers#tokenization-in-python-sdk and the release note https://docs.cohere.com/changelog/python-sdk-v520-release.
Do you use the token_strings? I wonder if it would be acceptable to remove them from the network call to achieve identical behaviour.
Yes, token_strings
removing from the network call would also make things more uniform.
I have a use case that uses token_strings, however, I can work around this issue - I can get the token strings by using the Hugging Face tokenizers
library directly with the downloaded tokenizer.json
files.
Another alternative would be to add a parameter that controls whether token_strings
are returned (in both the library and the server API).
When I run the script on this doc: https://docs.cohere.com/reference/tokenize
I get:
where
token_strings
is an empty array, even thought the docs suggests that it should be non-empty. However, if I run:I get the
token_strings
as expected:It would be nice if
token_strings
could be supported for offline tokenization, so that the online and offline behavior is identical. I'll attach a pull request for how this could be done.