LanguageMachines / ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --
https://languagemachines.github.io/ucto
GNU General Public License v3.0
65 stars 13 forks source link

Byte-order mark followed by space or tab results in Folia error #79

Closed marijnschraagen closed 4 years ago

marijnschraagen commented 4 years ago

If a text file starts with a byte-order mark directly followed by a space or a tab then Folia gives an error. Column-based output is not affected. See the attached file as an example: bom.txt This file starts with a byte order mark EF BB BF, then a space 20, then Jan loopt.:

$ xxd bom.txt
efbb bf20 4a61 6e20 6c6f 6f70 742e 0a0a  ... Jan loopt...

This results in the following error:

$ frog -t bom.txt -X bom.xml
[...]
frog-:Wed Oct  7 21:34:35 2020 Frogging bom.txt
1           [] SPEC(symb)  1.000000    O   B-NP    0   ROOT
2   Jan Jan [Jan]   SPEC(deeleigen) 1.000000    B-PER   B-NP    3   su
3   loopt   lopen   [loop][t]   WW(pv,tgw,met-t)    0.998612    O   B-VP    0   ROOT
4   .   .   [.] LET()   1.000000    O   O   3   punct

terminate called after throwing an instance of 'folia::ValueError'
  what():  TextContent: 'value' attribute may not be empty.
Afgebroken (geheugendump gemaakt)

Either removing the BOM or removing the space at the start (but leaving the BOM) results in succesful parsing of the file. A tab character instead of a space also triggers the issue.

kosloot commented 4 years ago

Interesting bug :) Frog relies on Ucto to handle BOM markers. (that those are evil goes without saying), so I assume the bug is more of an Ucto issue. Will move the issue to Ucto,

kosloot commented 4 years ago

Interestingly, this bug is hard to reproduce in Ucto itself, as Ucto Frog uses the Ucto API slightly different from Ucto itself. Still it is an Ucto problem.

kosloot commented 4 years ago

I committed a fix in Ucto. This should solve the problem. Please test.

marijnschraagen commented 4 years ago

I tried to update lamachine but I got an error about installing Aptitude (even though I installed from scratch yesterday and everything worked smoothly). Maybe I can try updating just ucto using lamachine-update --only, which packages should I specify to get the new ucto only?

kosloot commented 4 years ago

@marijnschraagen you have to wait until @proycon updates LaMachine. In this case the Development version, until the bug fix is approved and officially relased. Hope @proycon reacts soon....

proycon commented 4 years ago

Sorry for the delay! Thanks for the fix @kosloot! I'm testing it right now and will do a release straight away if this indeed fixes it.

I tried to update lamachine but I got an error about installing Aptitude (even though I installed from scratch yesterday and everything worked smoothly).

That is strange, can you create an issue if it persists?

Maybe I can try updating just ucto using lamachine-update --only, which packages should I specify to get the new ucto only?

You can do lamachine-update --only languagemachines-basic, which includes frog and ucto. But it will only work if you're on the development version. Or just hold on until I publish the release and then it'll work in the stable LaMachine too.

proycon commented 4 years ago

The fix works and I have now released ucto v0.22, it should be available in LaMachine after a lamachine-update (or a fresh installation).