-
Is tokenization_rwkv5.py equivalent to tokenization_rwkv_world.py from https://huggingface.co/RWKV/v5-Eagle-7B-HF/tree/main? I saw that WordpieceTokenizer from tokenization_rwkv5.py uses whitespace_to…
-
I'm working with a corpus that primarily consists of longer documents. I'm seeking recommendations for the most effective approach to semantically tokenize them.
Examples:
```
Original Text: "I…
-
### Clear and concise description of the problem
`codeToTokens` to return the grammar state after tokenization.
### Suggested solution
This could be beneficial for tokenization in the editor I'm cu…
-
The output of the tokenization code is too long. We dont have to show the full output. @ananyaanand0501
-
Hi,
I want to use both chinese search and vector search with bm25.How can i set properties tokenization?
It seems to not work when i set tokenization: "word"
-
Hello!I found your work to be exceptionally insightful and engaging.
I noticed that there are three pkls in your project, namely char_ voc.pkl, code_ voc.pkl and nl_ voc.pkl, so which file is used fo…
-
What are people's thoughts on adding preprocessing scripts to allow BPE-like tokenization of characters? Technically we already support this (just tokenize your input and use delineation function). Bu…
-
**Is your feature request related to a problem? Please describe.**
For generative models, many are limited by a maximum number of tokens. in some workflows, the prompts are generated dynamically t…
-
When I run the script on this doc: https://docs.cohere.com/reference/tokenize
```
response = co.tokenize(text="tokenize me! :D", model="command")
```
I get:
```
tokens=[10002, 2261, 2012, …
-
### Feature request
On the `/tokenize` endpoint of TGI, add an option to apply the chat template from the model's tokenizer, if existant, before tokenizing.
### Motivation
The `/tokenize` e…