LanguageMachines / ucto

Unicode tokeniser. Ucto tokenizes text files: it separates words from punctuation, and splits sentences. It offers several other basic preprocessing steps such as changing case that you can all use to make your text suited for further processing such as indexing, part-of-speech tagging, or machine translation. Ucto comes with tokenisation rules for several languages and can be easily extended to suit other languages. It has been incorporated for tokenizing Dutch text in Frog, our Dutch morpho-syntactic processor. http://ilk.uvt.nl/ucto --
https://languagemachines.github.io/ucto
GNU General Public License v3.0
65 stars 13 forks source link

Tokenization of t-style element that has font_typeface Feature #82

Closed pirolen closed 3 years ago

pirolen commented 3 years ago

I have just came across this: When t-style has a font_typeface Feature (e.g. with superscript as class), seems like in certain contexts the styled item is not split from its left neighbor, despite the t-style tag="token".

Two examples for correct split:

`

zen:b

`

—>

<w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_001.text.div1.p4.s.3.w.36" class="WORD" set="tokconfig-deu" space="no" textclass="OCR">
            <t class="OCR">zen</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_001.text.div1.p4.s.3.w.37" class="PUNCTUATION" set="tokconfig-deu" space="no" textclass="OCR">
            <t class="OCR">:</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_001.text.div1.p4.s.3.w.38" class="WORD" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">b</t>
          </w>

and

`

träge über „die Landarbeiter in Knechtschaft und Freiheit“d

`

—>

          <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_007.text.div1.p4.s.3.w.46" class="WORD" set="tokconfig-deu" space="no" textclass="OCR">
            <t class="OCR">Freiheit</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_007.text.div1.p4.s.3.w.47" class="PUNCTUATION" set="tokconfig-deu" space="no" textclass="OCR">
            <t class="OCR">“</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_007.text.div1.p4.s.3.w.48" class="WORD" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">d</t>
          </w>

Two examples for no split:

`

tümern Mecklenburg und dem wegen der Verwandtschaft seinera

`

—>

        <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_001.text.div1.p4.s.2.w.33" class="WORD" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">Verwandtschaft</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_001.text.div1.p4.s.2.w.34" class="WORD" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">seinera</t>
          </w>

and

`

hoch die Einnahme des Arbeiters sich thatsächlich beläuftc

`

—>

<w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_003.text.div1.p3.s.4.w.15" class="WORD" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">thatsächlich</t>
          </w>
          <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_oxed_003.text.div1.p3.s.4.w.16" class="WORD" set="tokconfig-deu" space="no" textclass="OCR">
            <t class="OCR">beläuftc</t>
          </w>
kosloot commented 3 years ago

As far as I can see this is fixed now in the master branch

pirolen commented 3 years ago

I updated my LaMachine dev, and the fix works, thanks!

@proycon: After updating, calling python-ucto from a .py script gives an error:

ucto: textcat configured from: /home/ubuntu/piro/projects/lamadev/lmdev/share/ucto/textcat.cfg free(): invalid next size (normal) Aborted

In the script I have: configurationfile = "tokconfig-deu" tokenizer = ucto.Tokenizer(configurationfile)

In the python interpreter this works fine.

proycon commented 3 years ago

Does it also work fine when you close the interpreter? I encountered such free() issues before.

Usually a forced recompilation of python-ucto helps, you may want to explicitly delete $LM_PREFIX/src/python-ucto and $LM_PREFIX/lib/python3.8/site-packages/*ucto*so (the python version number may differ) and run lamachine-update again. Running it with the force=1 option should do it automatically but takes longer.

pirolen commented 3 years ago

Does it also work fine when you close the interpreter?

If I simply run the script on the command line, which has nothing in it just calling ucto, then I get the error.

you may want to explicitly delete $LM_PREFIX/src/python-ucto and $LM_PREFIX/lib/python3.8/site-packages/*ucto*so (the python version number may differ) and run lamachine-update again.

OK, am updating after having deleted those two.

Before I did only a partial lamachine-update (since the entire update fails), and will try with partial updates again. Not sure if the issue might be caused by that, i.e. lamachine-update --only languagemachines-python languagemachines-basic

If it does not help, will try the full update with force=1 option.

pirolen commented 3 years ago

lamachine-update --only languagemachines-python languagemachines-basic

OK, probably not a surprise to you, this did not work out (got a No module named 'ucto' afterwards). Trying the full update.

proycon commented 3 years ago

Sorry, you need languagemachines-python-bindings for this one.

pirolen commented 3 years ago

Thanks! Updating that fails unfortunately, likely for the same reason as for the full update (python-frog; https://github.com/proycon/LaMachine/issues/196). Please find the log attached.

lamachine-lmdev-20210421_142624.log

kosloot commented 3 years ago

Ok, python-frog fails on:
no matching function for call to ‘FrogAPI::FrogAPI(FrogOptions&, TiCC::Configuration&, TiCC::LogStream*, TiCC::LogStream*)’ This has indeed changed in the more recent Frog. I didn't realize python-frog uses this constructor. But it does, so some adjustments are needed. @proycon sorry about this. @pirolen I hope @proycon will fix this soon

proycon commented 3 years ago

This affects the development Frog only right? We'll have to update the development python-frog then but no need to rush a release?

proycon commented 3 years ago

@kosloot In the new situation there seems to be no way anymore to pass TiCC::Configuration& ? We need that for python-frog.

pirolen commented 3 years ago

(I'm only using the dev LaMachine.)

kosloot commented 3 years ago

I extended the FrogAPI with a constructor using TiCC::Configuration. This should help you fixing this

kosloot commented 3 years ago

Ok, that was a bad idea. Reverted it

proycon commented 3 years ago

@pirolen Issue proycon/python-frog#20 should be fixed now, the development lamachine should compile again.

pirolen commented 3 years ago

You rock! Update and ucto work fine now. Will test the entire pipeline further.

pirolen commented 3 years ago

Would there be a way with foliapy to access the text in a t-style element without the t-hspace? In the .text() method I did not find a corresponding parameter.

kosloot commented 3 years ago

Why would you want to do this? the <t-hspace> is inserted because a space is present in the original Abbyy file (at least that is intended) and that implies we must assume it was present in the original text too. So why ditch it then? Side note : <t-hspace> is only used for trailing or leading spaces. So I assume that a simple Sting.trim() would do what you want?

pirolen commented 3 years ago

I came across a case when the superscripted item ends in such a trailing space. The superscripts denote footnote reference indices, which I link to a corresponding note item in the footnote area, by string matching. But the trailing space breaks the matching. I was just wondering if I should find a foliapy-ic solution or just strip the space.

Sure, the space is present in the original (pls see screenshots).

Screenshot 2021-04-27 at 12 50 20 Screenshot 2021-04-27 at 12 52 43

kosloot commented 3 years ago

So the FoLiA reflects the original document. That is what we want. It's up to the user for more or alternative interpretations.

I assume we can close this discussion now.