-
The new rules in Appendix A.3 on tokenization are, I believe, a great improvement on what went before. But I think there is one thing missing: they claim that the rules allow you to identify boundarie…
-
Just a reminder to self to revisit the `yield from` tokenization and add more extensive tests for this.
Things to have a critical look at whether they are tokenized correctly/how they are tokenized…
jrfnl updated
6 hours ago
-
I tried finetuning my model after stage 1. Apparently, there are tokenization mismatches and the loss is 0.
Do you have any ideas what might be the problem.
Thanks!
sh finetune_full.sh
```WARNIN…
-
We should have a batch end point for tokenization that we can use here:
https://github.com/dust-tt/dust/blob/656576bff4220b84a68099c5e926ec42949c83e6/front/lib/api/assistant/generation.ts#L172-L194
-
While running this code (based on [pretokenization code](https://github.com/databio/scripts/blob/master/model-training/region2vec-encode/pretokenize.py)):
```python
import os
import multiprocessi…
-
Currently, tok2spans.iob2spans accepts parallel lists of tokens and IOB-style labels. Since there is no single text, it constructs that text by concatenating the tokens with a single space as a delim…
-
When i trained llava-llama3 use your code, the log print tokenization mismatch as below.
how to fix it?
thanks!
WARNING: tokenization mismatch: 55 vs. 54. (ignored)
WARNING: tokenization m…
-
-
### System Info
Sample Docker Compose File
```
embedding:
image: ghcr.io/huggingface/text-embeddings-inference:cpu-1.0
platform: linux/amd64
volumes:
- embed_data:/data
…
-
Currently, the tokenization method for processing text is by default the `RecursiveTextSplitter`, this should be given as a parameter depending on the type of document uploaded.