-
- [ ] What sort of tokenization will be done?
- [ ] Scripts/tutorials that can do the tokenization?
- [ ] Modified/newly created tokenization script to feed into the rest of the pipeline
-
In the ED IIIb data from Girsu, the tokenization is not consistent. Examples include:
* udu nita (P221436) vs. udu-nita (P010556)
* ugula ki-siki-ka (P221485) vs. ugula ki siki-ka (P221319)
* ziz…
-
Hi @tomerm @semion1956
As discussed before, I would like to have another boolean parameter in the Tokenization process. The parameter will define if we want to do Tokenization or not. Now I have to …
-
At least the following appear in the data:
* `Gam esIndustry-julkaisu`
* `ki rjoitettu`
* `myytyynYounitediin`
* `tal lennustilaa`
* `jaNokia`
* `televisionkatselu un`
* `Lumia-puheli mia`
*…
-
Spacy models should be modified according medical corpus. For example:
`tokens['train'][0:10]: [['EMEA', '/', 'H', '/', 'C', '/', '551', 'PRIALT']...`
-
The user input is tokenized again and again for each intent. This needs a complete revision of the structure ...
-
Because who wants to type in %HERPDERP
-
The current tokenisation story of VS Code is based on TM grammars, which are pretty powerful, but we are running into their limits if we want to do something more than a top-down scanner can do. Also,…
-
### System Info
Hello developer,
The Llama-3 model was released today.
I want to convert this model to a hf model, but when I follow the readme, the following issue occurs.
` File "/workspace/…
-
When the original text misses a space after the end of the sentence the last word of the previous sentence and the first one of the next are considered to be one .
Equally wrong is when two …