-
४,३२,००० get tokenized as ४ , ३२ , ०००. This should not happen.
-
1. 我已搜索相关问题,但无法获得预期的帮助。
**描述错误**
运行默认工作流,弹出如下错误:
相关目录已经有相关的文件但是任然提示找不到。
!!! Exception during processing!!! Unable to load vocabulary from file. Please check that the provided vocabulary is…
-
1. 12.6 should not split
2. 22थी should split
3. थी22 should split
-
Since the beginning, Sphinx and Manticore have not offered per-field tokenization settings (except for `morphology_skip_fields` and `infix/prefix_fields`), and it seems that there hasn't been much con…
-
бассейну реки -0.12846 -0.0077064 0.049087 -0.059458 ...
-
### Preliminary Remark
The observations presented here are also relevant for the _polmineR repository._
### Some Background
The _Bundestag Protokolle_ often employ spacing to enhance readability …
-
When the original text misses a space after the end of the sentence the last word of the previous sentence and the first one of the next are considered to be one .
Equally wrong is when two …
-
### System Info
Hello developer,
The Llama-3 model was released today.
I want to convert this model to a hf model, but when I follow the readme, the following issue occurs.
` File "/workspace/…
-
Hello,
I believe the corpus and the `word_freqs` output used in the [BPE](https://github.com/huggingface/course/blob/main/chapters/en/chapter6/5.mdx#implementing-bpe) / [WordPiece](https://github.c…
-
I'm hoping that we can get to the point where we fully support the following languages.
- English
- Spanish
- German
- French
- Russian
- Japanese
- Hindi
- Farsi
- Chinese
- Arabic
I s…