Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --
I came across an edge-case where ucto ran as user www-data (which has no $HOME), but the variable $HOME was still set to /root. This means that ucto was trying to find config files in /root/.config (localConfigDir), but because this was unaccessible (permission denied, and in fact the whole thing didn't exist in the first place). Ucto exited with an error rather than falling back to defaultConfigDir (e.g. /usr/share/ucto) which did exist:
I came across an edge-case where ucto ran as user
www-data
(which has no$HOME
), but the variable$HOME
was still set to/root
. This means that ucto was trying to find config files in /root/.config (localConfigDir
), but because this was unaccessible (permission denied, and in fact the whole thing didn't exist in the first place). Ucto exited with an error rather than falling back todefaultConfigDir
(e.g./usr/share/ucto
) which did exist:Proposed solution: no hard failure on such filesystem errors during configuration discovery if there are fallback files to try.
Workaround: Make sure
$HOME
is either valid simply blank, then ucto works fine.