-
As seen in the original iteration of this core, it is possible to achieve most of what we have here through using a custom tokeniser.
Initially, we didn't do this since we wanted to avoid duplicati…
-
This is what I had expected:
```
Standing Rock-vuosttaldemiin @U.Cap.Obl@Standing Rock+CmpNP/First+N+Prop+Sem/Plc@U.Cap.Obl@+Cmp/SgNom@P.CmpFrst.FALSE@@P.CmpPref.FALSE@@D.CmpLast.TRUE@@D.CmpNone.T…
-
(sorry for the long post)
It would be very helpful when developing spelling dictionaries intended to be used in a graphical environment like OOo and others, to be able to spell check texts in exactly…
-
For example here:
![imatge](https://user-images.githubusercontent.com/449545/109909263-b4cd5f80-7c9d-11eb-8d2c-68d52ece0e35.png)
The original text is:
```
common_voice_th_23657260.mp3 สองอันเท…
-
When Tokenization is disabled on long lines, such as when I am reading a minified file or, yes, the line is in fact very long for a reason, I am plagued with messages telling me that Token…
-
I'm really looking forward to your paper on the FT-TabPFN and the release of the source code! The novel Feature Tokenisation layer sounds like a significant enhancement for handling categorical featur…
-
Closed issue #45 indicates that udpipe was used and `__main__.py` suggests that you use the expanded form for conll multiword tokens, e.g. 2 tokens "de le" instead of "du" in French. The readme should…
-
Azure DevOps reports the following warning when running against a hosted agent.
##[warning]Task 'Tokenization' (2.10.0) is using deprecated task execution handler. The task should use the supported…
-
For non-network division BERT benchmarks the dataset is tokenised outside of running the benchmark but in the inference rules, the use of "Text in. No compression allowed." implies that for Network Di…
-
Add functionality to the existing tokenisation routines (#2) so that tokens can be split into subtokens and adjacent subtokens can be merged.