LanguageMachines / foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)
https://proycon.github.io/folia
GNU General Public License v3.0
4 stars 3 forks source link

FoLiA-2text. How to handle ``<t-str>`` and ``<t-hbr/>`` correctly. Is it even possible? #56

Closed kosloot closed 3 years ago

kosloot commented 3 years ago

given the following FoLiA. (which is a simplified outcome of FoLiA-abby, with all features and metrics removed.)

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="FA-morse_OCR_perletter" generator="libfolia-v2.8" version="2.4.2">
  <metadata type="native">
    <annotations>
      <paragraph-annotation/>
      <style-annotation/>
      <hyphenation-annotation/>
      <string-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="text">
    <p xml:id="text.p1">
      <t>
        <t-str xml:id="text.p2.t-str.1">
          <t-style>Da in diesem Kapitel hauptsächlich eine Erschei<t-hbr/>
      </t-style>
    </t-str>
    <t-str xml:id="text.p2.t-str.2">
      <t-style>nung der Kaiserzeit behandelt werden soll,</t-style>
    </t-str>
      </t>
    </p>
  </text>
</FoLiA>

The output from FoLiA-2text (and also it;s counterpart folia2txt) is:

Da in diesem Kapitel hauptsächlich eine Erschei
    nung der Kaiserzeit behandelt werden soll,

There are 2 issues here, which are very related.

  1. The <t-hbr/> is NOT reflected in the output at all
  2. The formatting spaces before the second t-str are outputted.

In fact this is maybe just one problem. A small modification of the input will fix this:

<?xml version="1.0" encoding="UTF-8"?>
<FoLiA xmlns:xlink="http://www.w3.org/1999/xlink" xmlns="http://ilk.uvt.nl/folia" xml:id="FA-morse_OCR_perletter" generator="libfolia-v2.8" version="2.4.2">
  <metadata type="native">
    <annotations>
      <paragraph-annotation/>
      <style-annotation/>
      <hyphenation-annotation/>
      <string-annotation/>
      <text-annotation set="https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl"/>
    </annotations>
  </metadata>
  <text xml:id="text">
    <p xml:id="text.p1">
      <t>
        <t-str xml:id="text.p2.t-str.1">
          <t-style>Da in diesem Kapitel hauptsächlich eine Erschei<t-hbr/>
      </t-style>
    </t-str><t-str xml:id="text.p2.t-str.2">
      <t-style>nung der Kaiserzeit behandelt werden soll,</t-style>
    </t-str>
      </t>
    </p>
  </text>
</FoLiA>

so appending the second directly after the first.

This gives the correct: Da in diesem Kapitel hauptsächlich eine Erscheinung der Kaiserzeit behandelt werden soll,

This behavior might be surprising for a naive user, and, as far as I know, it can only be done correctly using an editor to get this kind of formatting done. At the moment, I don't see a way to get this result using the libxml2 API (but I may stand corrected) NB: I know we could output the whole FoLiA in one flat line, but this is hardly desirable.

Comments welcome

proycon commented 3 years ago

This is a good point indeed. Perhaps this has been an oversight when we handled proycon/folia#88. I agree that the behaviour is not entirely intuitive here now. The spacing and newlines between the two t-strings is not stripped (because it's not initial nor trailing space). It's also essential that we allow spacing there because we'll often see things like <t-str>a</t-str> <t-str>b</t-str>.

But in case of a direct newline and spacing after </t-str>, we may want to decide to strip after all. I'll have to give this a bit more thought, it might cause backward compatibility issues again.

proycon commented 3 years ago

As for <t-hbr/> not being reflected in the output, there is a specific unicode character we could output here (I'm not sure if we don't already use it actually, it's a non-spacing codepoint if I remember correctly). Perhaps it should be an opt-in feature as I don't expect many people will expect it in the output.

kosloot commented 3 years ago

OK, I discovered that libxml2's output formatting is sensitive for spaces inside text nodes. To demonstrate this, I added a test tot foliatests, text_test18() this test produces 2 variant of FoLiA, that on first sight are equivalent, but aren't! excerpts:

      <t>
        <t-str xml:id="text.p.1.t-str.1">
          <t-style>deel<feat class="some" subset="things"/><t-hbr/></t-style>
        </t-str>
        <t-str xml:id="text.p.1.t-str.2">
          <t-style>woord<feat class="other" subset="things"/></t-style>
        </t-str>
        <t-str> extra</t-str>
      </t>

and

      <t><t-str xml:id="text.p.1.t-str.1"><t-style>deel<feat class="some" subset="things"/><t-hbr/></t-
style></t-str><t-str xml:id="text.p.1.t-str.2"><t-style>woord<feat class="other" subset="things"/></t-s
tyle></t-str> <t-str>extra</t-str></t>

The only difference is that in the first case the space is added in " extra", and in the second case as a separate XmlText element, triggering NO formatting. (as we would like)

I can probably use this in FoLiA-abby to "do the right thing".

I wonder if FoLiA-PY also able to get this result

pirolen commented 3 years ago

Hopefully not off (otherwise please delete...): I have just sent some possibly related whitespace questions to proycon (by email), in the context of folia2html. It removes <t-hbr/> and a whitespace can be seen instead. When converting superscript, i.e. <t-style><feat class="superscript" subset="font_typeface"....> elements, a space appears around the corresponding html spans, which are not there in the FoLiA.

kosloot commented 3 years ago

Apart form all the text representation issues, there is a more fundamental one considering FoLiA-abby. @pirolen would like a clear mirroring in the FoLiA of the original (very convoluted) Abbyy files, including formatting. But FoLiA is more or less directed to a nice representation of the TEXT in the files, suited for Frog/Ucto etc.

For this 'textview', items like <br/> (or just a newline) and a <h-br/> are opaque. Frog would be very happy with just:

<text xml:id="text">
  <p xml:id="text.p1">
    <t>Da in diesem Kapitel hauptsächlich eine Erscheinung der Kaiserzeit behandelt werden soll,</t>
      <metric class="first_char_top" value="3316"/>
      <metric class="first_char_left" value="379"/>
      <metric class="first_char_right" value="406"/>
      <metric class="first_char_bottom" value="3358"/>
      <metric class="last_char_top" value="3351"/>
      <metric class="last_char_left" value="905"/>
      <metric class="last_char_right" value="911"/>
      <metric class="last_char_bottom" value="3357"/>
  </p>
</text>

Over this:

<text xml:id="text">
  <p xml:id="text.p1">
    <t>
      <t-str xml:id="text.p2.t-str.1">
        <t-style><feat class="Arial" subset="font_family"/><feat class="6." subset="font_size"/>Da in diesem Kapitel hauptsächlich eine Erschei<t-hbr/>
    </t-style>
      </t-str><t-str xml:id="text.p2.t-str.2">
    <t-style><feat class="Arial" subset="font_family"/><feat class="6." subset="font_size"/>nung der Kaiserzeit behandelt werden soll,</t-style>
      </t-str>
    </t>
    <metric class="first_char_top" value="3316"/>
    <metric class="first_char_left" value="379"/>
    <metric class="first_char_right" value="406"/>
    <metric class="first_char_bottom" value="3358"/>
    <metric class="last_char_top" value="3351"/>
    <metric class="last_char_left" value="905"/>
    <metric class="last_char_right" value="911"/>
    <metric class="last_char_bottom" value="3357"/>
  </p>
</text>

Our challenge is, to accommodate both. Maybe the best solution is, to create (steered by options in FoLiA-abby,) more flavors. One with all the formatting info needed, and one with a more simple representation, suited for Ucto/Frog. This can too be realized in one file, by using different text classes. Like OCR for 'normal' text and 'LAYOUT' for the full blown representation. There is a strong relation between the two, but the OCR variant will lack a lot of spaces/tabs and hyphens.

@proycon what do you think?

proycon commented 3 years ago

I'm not sure if we need more options and flavour. I agree that the challenge is to accommodate both in a way. Frog, ucto and other FoLiA tools should be as happy with the first fragment as with the second and be able to process it both. From the perspective of Frog and ucto, the more convoluted example is functionally equivalent to the simpler one. Of course you can always add extra text layers, but I'm not sure if that doesn't add more to the confusion than alleviate it.

I'm more concerned about what you addressed earlier in your text_test18 case, that really is something I need to dive into and we need to get straight.

proycon commented 3 years ago

I think that text_test18 issue relates to what we did in https://github.com/proycon/folia/issues/92 and earlier in https://github.com/proycon/folia/issues/88

kosloot commented 3 years ago

I agree that we should come up with a minimal and as simple solution as possible. But supporting both 'worlds' can be cumbersome, I fear.

An original text of:

Chapter 1
    A hyphened sen-
    tence

can be presented in a lot of ways in FoLiA, but the preferred text representation for ucto/frog should be something like:

<t>Chapter 1</t>
<t>A hyphened sentence</t>

So the task is, whatever FoLiA representation is used, to get the above 'basic' form. Main task is to avoid 'disturbing' spaces. Which can be done with libfolia, as I showed earlier, but is not really part of the FoliA paradigm, I suppose.

kosloot commented 3 years ago

in https://github.com/proycon/folia/issues/88 @pirolen says:

After tokenization with ucto, the t-hbr is gone/turned into a token boundary. In my ideal workflow, the soft break would stay recoverable (and propagatable to FLAT and folia2html), if possible at all.

This is intentional. The idea is that hyphenated words are to be 'corrected' into their 'normal' form like in the example above:

A hyphened sen-
tence

into

A hyphened sentence

I agree, that we loose some information then.

Simplest solution is, imho to introduce a 'soft-hyphen' there. (unicode \u00ad) see https://en.wikipedia.org/wiki/Soft_hyphen these are opaque for Ucto and therefor Frog. But I assume other tools can pick them up and do whatever they like

Note that is also possible to add a text-string to the <t-hbr/> element like this: <t-hbr>-</t-hbr> or <t-hbr>whats the point</t-hbr>

That might also be helpful, but specific for a singe use-case. A more generic solution is desirable.

pirolen commented 3 years ago

I see, cool. For a bit more context:

In my scenario, both t-hbr and br are needed for provenance reasons, so that one can maintain line break information from the original OCR-ed documents. But it would be good if ucto would yield the 'normal' form, so that in FLAT the hyphenated word would be renderable/editable as a single token.

It might be that (in my use case) there is some similarity in the way I'd like to make use of the breaks resp. the font style information: both need to be kept and reassigned after using ucto + FLAT... Would this be feasible with the post-processing @proycon mentioned here? https://github.com/proycon/foliatools/issues/19#issuecomment-789958696

pirolen commented 3 years ago

More details: in my use case the t-hbr is added by FoLiA-abby, when at the and of a line (in Abbyy XML) there is a hyphen (in the below example "... Erwartungen der-(line ends)jenigen...".

Seems like the converter puts each line into a separate t-str:

      <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.2">
        <t-style><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}" subset="font_style"/>wertlos halten. Es entspricht einerseits nicht den Erwartungen der<t-hbr/></t-style>
      </t-str>
      <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.3">
        <t-style><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}" subset="font_style"/>jenigen, welche in betreff der Lage der Landarbeiter nur solche</t-style>
      </t-str>

and then each t-str boundary is interpreted by ucto as a token boundary:

      <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.s.2.w.6" class="WORD" set="tokconfig-deu" textclass="OCR">
        <t class="OCR">Erwartungen</t>
      </w>
      <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.s.2.w.7" class="WORD" set="tokconfig-deu" textclass="OCR">
        <t class="OCR">der</t>
      </w>
      <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.s.2.w.8" class="WORD" set="tokconfig-deu" space="no" textclass="OCR">
        <t class="OCR">jenigen</t>
      </w>
      <w xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.s.2.w.9" class="PUNCTUATION" set="tokconfig-deu" textclass="OCR">
        <t class="OCR">,</t>
kosloot commented 3 years ago

That is not a problem with ucto as such. That has ALL to do with the interpretation of whitespace in FoLia, as to be resolved in v2.5 see; https://github.com/proycon/folia/issues/88

The indentation between 'der' and 'jenigen' is kept. (as can also be seen by using folia2txt or FoLiA-2text on this file: problem.xml.txt

pirolen commented 3 years ago

Sure, just meant to provide some (hopefully relevant) details, as @proycon was asking:

Do you have an example? I have a test document https://github.com/proycon/folia/blob/master/examples/whitespace-linebreaks.2.0.0.folia.xml and ucto does handle that correctly. A div with <t>Don't leave me bro<t-hbr/>ken and alone!</t> correctly produces the token <w><t>broken</t></w> Are you sure there's not more going on like a leading/trailing space?

proycon commented 3 years ago

Ok, the fact that they are in separate <t-str> elements need not be an obstacle, but the fact there there is a newline between these two <t-str> elements (and their contents) will be a problem. Also in the new situation we're implementing for FoLiA v2.5, this causes a whitespace (and therefore ucto creates a token break).

A correct input for your desired ucto output (in the FoLiA v2.5 situation) would be:

      <t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.2">
        <t-style><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}" subset="font_style"/>wertlos halten. Es entspricht einerseits nicht den Erwartungen der<t-hbr/></t-style></t-str><t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_005.text.div1.p2.t-str.3"><t-style><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{3C19F4A8-2234-4EE8-9373-EBFA03C5A2A4}" subset="font_style"/>jenigen, welche in betreff der Lage der Landarbeiter nur solche</t-style>
      </t-str>

It might be a bit tricky to get the proper XML serialisation, I find it hard to predict how it will turn out. In such cases we may want to go for a normalised serialisation that simply keeps the entire element one line (no indentation and newlines), which makes everything a lot easier.

kosloot commented 3 years ago

Well, as pointed out 22 days ago, I was able to such a serialization, but it is indeed trickery. It might be sufficient though. Otherwise I still think we could flee into the soft-hypen or the space="no" solution. wild idea: a space="no" attribute could maybe be implicit for <t-hbr> as a hint for "our" text normalization?

kosloot commented 3 years ago

So after a lot of work by @proycon and me on FoLiA definitions and some additions, I finally got a version of libfolia and foliautils where FoLiA-abby seems to do 'the right thing' It is not released fully yet, as some real testing and code cleanup is necessary. @pirolen are you able to test these using the latest master versions from GIT?

I hope we finally reached a satisfying result. Please let me know if still some issues are unresolved (or new popped up)

pirolen commented 3 years ago

Awesome, thanks to both of you very much for this grand effort! I am happy to test it later today.

pirolen commented 3 years ago

Is it enough if I update a dev lamachine instance with --only languagemachines-basic and --only languagemachines-python? The full update somehow fails.

kosloot commented 3 years ago

@proycon should be able to help you. I am no LaMachine expert at all

proycon commented 3 years ago

Is it enough if I update a dev lamachine instance with --only languagemachines-basic and --only languagemachines-python? The full update somehow fails.

Yes, that should be enough

pirolen commented 3 years ago

If I convert the FoLiA-abby output with folia2html, there is whitespace between a word and its (adjacent) sub-/superscripted item. Also, there is a whitespace of t-hbr in the html, even if I use span.hbr { display: none; } in the css. (The hbr-s are always at the end of t-str-s, in the last t-style element.)

pirolen commented 3 years ago

The extra whitespace seems to emerge also with the other typographic style markups, e.g. italic.

There is a double space before 'lokalen' (see screenshot below):

<t-str xml:id="FA-b1_3_1_mwtext_vorbemerk_pp61_67_001.text.div1.p45.t-str.4">
            <t-style><t-lang class="GermanStandard"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{EB511248-0012-4197-B6EF-1EC5E8377906}" subset="font_style"/>terverhältnissen der einzelnen Gegenden auf ihre<t-hspace class="space"/></t-style>
            <t-style><t-lang class="GermanStandard"/><feat class="italic" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{92EA860F-CE3A-4E49-A25F-66FA6C7CDD5B}" subset="font_style"/>lokalen</t-style>
            <t-style><t-lang class="GermanStandard"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{EB511248-0012-4197-B6EF-1EC5E8377906}" subset="font_style"/><t-hspace class="space"/>Ursachen<br/></t-style>
          </t-str>

in html:

<span class="str">
            <span class="style_none style_font_family_TimesNewRoman style_font_size_15 style_font_style_EB511248-0012-4197-B6EF-1EC5E8377906">terverhältnissen der einzelnen Gegenden auf ihre<span class="hspace"> </span></span>
            <span class="style_none style_font_typeface_italic style_font_family_TimesNewRoman style_font_size_15 style_font_style_92EA860F-CE3A-4E49-A25F-66FA6C7CDD5B">lokalen</span>
            <span class="style_none style_font_family_TimesNewRoman style_font_size_15 style_font_style_EB511248-0012-4197-B6EF-1EC5E8377906"><span class="hspace"> </span>Ursachen<br /></span>
          </span>

Screenshot 2021-04-12 at 19 15 19

kosloot commented 3 years ago

OK, forgot to un-comment some code that removed spurious spaces before a style element. Updated in GIT Maybe it solves a lot (all?) of the problems.

pirolen commented 3 years ago

Yes, the extra spaces disappeared, great!

But spotted space lacking, e.g. here after the italics of 'Vgl. auch':

Screenshot 2021-04-12 at 20 06 34

In:

<line baseline="593" l="343" t="562" r="732" b="597"><formatting lang="Latin" ff="Times New Roman" fs="10." italic="1" style="{5C5F397F-82D0-4034-ABAE-1551350CF5AC}">
<charParams l="343" t="566" r="364" b="590">V</charParams>
<charParams l="359" t="573" r="378" b="596">g</charParams>
<charParams l="379" t="567" r="387" b="589" suspicious="1">l</charParams>
<charParams l="388" t="585" r="393" b="589" suspicious="1">.</charParams>
<charParams l="394" t="572" r="410" b="589"> </charParams></formatting><formatting lang="GermanStandard" ff="Times New Roman" fs="10." italic="1" style="{5C5F397F-82D0-4034-ABAE-1551350CF5AC}">
<charParams l="411" t="572" r="425" b="589">a</charParams>
<charParams l="427" t="572" r="443" b="588">u</charParams>
<charParams l="445" t="572" r="457" b="589">c</charParams>
<charParams l="458" t="566" r="474" b="588">h</charParams></formatting><formatting lang="GermanStandard" ff="Times New Roman" fs="10." style="{5226A48F-7EB3-4F5B-9980-6EE40A5D4D10}">
<charParams l="475" t="565" r="490" b="588"> </charParams></formatting><formatting lang="Latin" ff="Times New Roman" fs="10." style="{5226A48F-7EB3-4F5B-9980-6EE40A5D4D10}">
<charParams l="491" t="565" r="504" b="588">J</charParams>
<charParams l="510" t="573" r="524" b="588">u</charParams>
<charParams l="531" t="571" r="541" b="588">s</charParams>
<charParams l="542" t="571" r="559" b="587"> </charParams>
<charParams l="560" t="571" r="572" b="587">c</charParams>
<charParams l="578" t="571" r="589" b="587">a</charParams>
<charParams l="596" t="571" r="610" b="587">n</charParams>
<charParams l="614" t="582" r="618" b="587">.</charParams>
<charParams l="619" t="570" r="636" b="587"> </charParams>
<charParams l="637" t="570" r="650" b="587">e</charParams>
<charParams l="655" t="565" r="662" b="587">t</charParams>
<charParams l="663" t="565" r="679" b="587"> </charParams>
<charParams l="680" t="570" r="692" b="587">c</charParams>
<charParams l="698" t="564" r="704" b="586">i</charParams>
<charParams l="710" t="569" r="725" b="586">v</charParams>
<charParams l="727" t="581" r="732" b="585">.</charParams></formatting></line></par>

Out:

      <p xml:id="FA-mittelalt_bibkat_sample_002.text.div1.p10">
        <t class="OCR"><t-str xml:id="FA-mittelalt_bibkat_sample_002.text.div1.p10.t-str.1"><t-style><t-lang class="Latin"/><feat class="italic" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{5C5F397F-82D0-4034-ABAE-1551350CF5AC}" subset="font_style"/>Vgl.<t-hspace class="space"/></t-style><t-style><t-lang class="GermanStandard"/><feat class="italic" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{5C5F397F-82D0-4034-ABAE-1551350CF5AC}" subset="font_style"/>auch</t-style><t-style><t-lang class="Latin"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{5226A48F-7EB3-4F5B-9980-6EE40A5D4D10}" subset="font_style"/>Jus can. et civ.<br/></t-style></t-str></t>
        <feat class="{1F10B8CF-03E0-42E4-B62B-C44FE1E4274B}" subset="par_style"/>
pirolen commented 3 years ago

But maybe one has to account for t-hspace in the css for the folia2html converter?

kosloot commented 3 years ago

Well, a <t-hspace> is correctly inserted. So it up to folia2html to handle it.

pirolen commented 3 years ago

There seems to be a t-hbr issue in relation to ucto + FLAT.

I run ucto on a file and tried uploading in to FLAT. Got an error:

Uploaded file is no valid FoLiA Document: FoLiA exception in handling of @ line 86 (in parent @ parent line 85) : [InconsistentText] Text for , is inconsistent: EXPECTED (deep text after normalization) *****> Exposicio quorundam -orum et vocabulorum 123 38 f. ****> BUT FOUND (strict text after normalization) ****> Exposicio quorundam -orum et vocabu lorum 123 38 f. ******* DEVIATION POINT: et vocabu<*HERE*> lorum 123 (also checked against older rules prior to FoLiA v2.4.1)Traceback (most recent call last): -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/folia/main.py", line 3352, in parsexml -- e = doc.parsexml(subnode, Class) -- File "/home/flatuser/flateditor/env/lib/python3.7/site-packages/folia/main.py", line 8608, in parsexml -- return Class.parsexml(node,self) -- File ....

Just in case:

foliavalidator --deep /home/ubuntu/piro/projects/mwg-digital-doku/dataextraction-infrastructure/processes-lamachine/abbyy2folia/ma_bibk/newconv/FA-mittelalt_bibkat_sample_002_ucto.folia.xml

Out:

Loaded set https://raw.githubusercontent.com/proycon/folia/master/setdefinitions/text.foliaset.ttl (19 triples) Loaded set https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl (92 triples) Loaded set https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-fra.foliaset.ttl (80 triples) VALIDATION ERROR on full parse by library (stage 2/3), in /home/ubuntu/piro/projects/mwg-digital-doku/dataextraction-infrastructure/processes-lamachine/abbyy2folia/ma_bibk/newconv/FA-mittelalt_bibkat_sample_002_ucto.folia.xml ParseError: FoLiA exception in handling of <t-style> @ line 90 (in parent <t-str> @ parent line 89) : [DeepValidationError] Set definition FoLiA-abby-set for t-lang not loaded (document FA-mittelalt_bibkat_sample_002, /home/ubuntu/piro/projects/mwg-digital-doku/dataextraction-infrastructure/processes-lamachine/abbyy2folia/ma_bibk/newconv/FA-mittelalt_bibkat_sample_002_ucto.folia.xml)

proycon commented 3 years ago

But maybe one has to account for t-hspace in the css for the folia2html converter?

That has been implemented in the latest version of foliatools, t-hspace renders as a simple space (and you can override it in your custom css if you want something more specific).

As to the foliavalidator issue, I see you're doing deep validation, does shallow validation pass? And are you running the FLAT from the latest development LaMachine (because we didn't release the latest changes to FLAT and foliadocserve yet)?

pirolen commented 3 years ago

But maybe one has to account for t-hspace in the css for the folia2html converter?

That has been implemented in the latest version of foliatools, t-hspace renders as a simple space (and you can override it in your custom css if you want something more specific).

Seems to me that it does not render as a space. In turn, there are a lot of empty lines in the html. I attach both the input and output files, OK?

I see you're doing deep validation, does shallow validation pass?

(Just because I thought the deep validation is the safest, but probably there is much more to it...)

If I simply run foliavalidator the output is:

> VALIDATION ERROR on full parse by library (stage 2/3), in /home/ubuntu/piro/projects/mwg-digital-doku/dataextraction-infrastructure/processes-lamachine/abbyy2folia/ma_bibk/newconv/FA-mittelalt_bibkat_sample_002_ucto.folia.xml
> ParseError: FoLiA exception in handling of <div> @ line 86 (in parent <text> @ parent line 85) : [InconsistentText] Text for <Paragraph at 140584772727808 id=FA-mittelalt_bibkat_sample_002.text.div1.p9 set=FoLiA-abby-set class=None>, is inconsistent: EXPECTED (deep text after normalization) *****>
> Exposicio quorundam -orum et vocabulorum 123 38 f.
> ****> BUT FOUND (strict text after normalization) ****>
> Exposicio quorundam -orum et vocabu lorum 123 38 f.
> ******* DEVIATION POINT:  et vocabu<*HERE*> lorum 123
> (also checked against older rules prior to FoLiA v2.4.1)

And are you running the FLAT from the latest development LaMachine (because we didn't release the latest changes to FLAT and foliadocserve yet)?

Ah, that's right, apologies, forgot. I am not using LaMachine for FLAT. Will test the ucto input to it later then, on the new release.

FA-mittelalt_bibkat_sample_002.png.folia.xml.txt FA-mittelalt_bibkat_sample_002.html.txt

kosloot commented 3 years ago

observation: when I run foliavalidator on the XML file it is happy, but with -d I get:

VALIDATION ERROR on full parse by library (stage 2/3), in /home/sloot/Downloads/FA-mittelalt_bibkat_sample_002.png.folia.xml.txt
ParseError: FoLiA exception in handling of <t-style> @ line 71 (in parent <t-str> @ parent line 71) : [DeepValidationError] Set definition FoLiA-abby-set for t-lang not loaded (document FA-mittelalt_bibkat_sample_002, /home/sloot/Downloads/FA-mittelalt_bibkat_sample_002.png.folia.xml.txt)

Ucto runs smoothly on this file.

folia2txt too, but gives some more empty lines compared to FoLiA-2text (but otherwise no differences) In particular, no complaint about vocabulorum

running foliavalidator -d on the Ucto output gives:

VALIDATION ERROR on full parse by library (stage 2/3), in t.xml
ParseError: FoLiA exception in handling of <s> @ line 29 (in parent <p> @ parent line 28) : [DeepValidationError] Not a valid class: SYMBOL (in set https://raw.githubusercontent.com/LanguageMachines/uctodata/master/setdefinitions/tokconfig-deu.foliaset.ttl for w with ID untitleddoc.p.1.s.1.w.1)

the missing SYMBOL seems an oversight in the set definition.

kosloot commented 3 years ago

I added SYMBOL (and EMOTICON and PICTOGRAM) to the setdefinitions, Deep validation is happy now.

pirolen commented 3 years ago

All seems fine so far in my current test scenarios.

pirolen commented 3 years ago

Hi, revisiting this again: I don't see the space from FoLiA-abby after 'auch'.

Screenshot to illustrate the original string legibly: Screenshot 2021-04-27 at 18 02 59

But spotted space lacking, e.g. here after the italics of 'Vgl. auch[space missing]':

In:

<line baseline="593" l="343" t="562" r="732" b="597"><formatting lang="Latin" ff="Times New Roman" fs="10." italic="1" style="{5C5F397F-82D0-4034-ABAE-1551350CF5AC}">
<charParams l="343" t="566" r="364" b="590">V</charParams>
<charParams l="359" t="573" r="378" b="596">g</charParams>
<charParams l="379" t="567" r="387" b="589" suspicious="1">l</charParams>
<charParams l="388" t="585" r="393" b="589" suspicious="1">.</charParams>
<charParams l="394" t="572" r="410" b="589"> </charParams></formatting><formatting lang="GermanStandard" ff="Times New Roman" fs="10." italic="1" style="{5C5F397F-82D0-4034-ABAE-1551350CF5AC}">
<charParams l="411" t="572" r="425" b="589">a</charParams>
<charParams l="427" t="572" r="443" b="588">u</charParams>
<charParams l="445" t="572" r="457" b="589">c</charParams>
<charParams l="458" t="566" r="474" b="588">h</charParams></formatting><formatting lang="GermanStandard" ff="Times New Roman" fs="10." style="{5226A48F-7EB3-4F5B-9980-6EE40A5D4D10}">
<charParams l="475" t="565" r="490" b="588"> </charParams></formatting><formatting lang="Latin" ff="Times New Roman" fs="10." style="{5226A48F-7EB3-4F5B-9980-6EE40A5D4D10}">
<charParams l="491" t="565" r="504" b="588">J</charParams>
<charParams l="510" t="573" r="524" b="588">u</charParams>
<charParams l="531" t="571" r="541" b="588">s</charParams>
<charParams l="542" t="571" r="559" b="587"> </charParams>
<charParams l="560" t="571" r="572" b="587">c</charParams>
<charParams l="578" t="571" r="589" b="587">a</charParams>
<charParams l="596" t="571" r="610" b="587">n</charParams>
<charParams l="614" t="582" r="618" b="587">.</charParams>
<charParams l="619" t="570" r="636" b="587"> </charParams>
<charParams l="637" t="570" r="650" b="587">e</charParams>
<charParams l="655" t="565" r="662" b="587">t</charParams>
<charParams l="663" t="565" r="679" b="587"> </charParams>
<charParams l="680" t="570" r="692" b="587">c</charParams>
<charParams l="698" t="564" r="704" b="586">i</charParams>
<charParams l="710" t="569" r="725" b="586">v</charParams>
<charParams l="727" t="581" r="732" b="585">.</charParams></formatting></line></par>

Out:

      <p xml:id="FA-mittelalt_bibkat_sample_002.text.div1.p10">
        <t class="OCR"><t-str xml:id="FA-mittelalt_bibkat_sample_002.text.div1.p10.t-str.1"><t-style><t-lang class="Latin"/><feat class="italic" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{5C5F397F-82D0-4034-ABAE-1551350CF5AC}" subset="font_style"/>Vgl.<t-hspace class="space"/></t-style><t-style><t-lang class="GermanStandard"/><feat class="italic" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{5C5F397F-82D0-4034-ABAE-1551350CF5AC}" subset="font_style"/>auch</t-style><t-style><t-lang class="Latin"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{5226A48F-7EB3-4F5B-9980-6EE40A5D4D10}" subset="font_style"/>Jus can. et civ.<br/></t-style></t-str></t>
        <feat class="{1F10B8CF-03E0-42E4-B62B-C44FE1E4274B}" subset="par_style"/>
pirolen commented 3 years ago

Sorry, also revisiting this:

Ok, the fact that they are in separate <t-str> elements need not be an obstacle, but the fact there there is a newline between these two <t-str> elements (and their contents) will be a problem. Also in the new situation we're implementing for FoLiA v2.5, this causes a whitespace (and therefore ucto creates a token break).

A correct input for your desired ucto output (in the FoLiA v2.5 situation) would be:

So this is the Abbyy input:

<charParams l="749" t="490" r="772" b="507"> </charParams>
<charParams l="773" t="490" r="785" b="507">e</charParams>
<charParams l="788" t="485" r="795" b="507">t</charParams>
<charParams l="796" t="485" r="811" b="507"> </charParams>
<charParams l="812" t="489" r="826" b="507">v</charParams>
<charParams l="829" t="489" r="843" b="505">o</charParams>
<charParams l="846" t="489" r="858" b="505">c</charParams>
<charParams l="861" t="489" r="872" b="505">a</charParams>
<charParams l="875" t="483" r="889" b="505">b</charParams>
<charParams l="893" t="489" r="907" b="506">u</charParams>
<charParams l="910" t="496" r="917" b="499">¬</charParams></formatting></line>
<line baseline="554" l="371" t="524" r="566" b="553"><formatting lang="Latin" ff="Times New Roman" fs="10." style="{6BF03378-BD9B-4B86-9D1F-B40E548CC919}">
<charParams l="372" t="527" r="378" b="549">l</charParams>
<charParams l="381" t="533" r="395" b="549">o</charParams>
<charParams l="398" t="532" r="407" b="548">r</charParams>
<charParams l="410" t="533" r="423" b="549">u</charParams>
<charParams l="427" t="533" r="448" b="549">m</charParams>
<charParams l="449" t="526" r="468" b="549"> </charParams>

and the FoLiA-abby output:

        <t class="OCR">
<t-str xml:id="FA-mittelalt_bibkat_sample_002.text.div1.p9.t-str.1"><t-style><t-lang class="Latin"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{6BF03378-BD9B-4B86-9D1F-B40E548CC919}" subset="font_style"/>Exposicio quorundam -orum et vocabu<t-hbr/></t-style></t-str>
<t-str xml:id="FA-mittelalt_bibkat_sample_002.text.div1.p9.t-str.2"><t-style><t-lang class="Latin"/><feat class="Times New Roman" subset="font_family"/><feat class="10." subset="font_size"/><feat class="{6BF03378-BD9B-4B86-9D1F-B40E548CC919}" subset="font_style"/>lorum 123 38 f.</t-style></t-str></t>

And thus the t-hbr is still a boundary between "vocabu" and "lorum". So ucto (and folia2html) make separate items from it, i.e.

          <w xml:id="FA-mittelalt_bibkat_sample_002.text.div1.p9.s.1.w.6" class="WORD" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">vocabu</t>
          </w>
          <w xml:id="FA-mittelalt_bibkat_sample_002.text.div1.p9.s.1.w.7" class="WORD" set="tokconfig-deu" textclass="OCR">
            <t class="OCR">lorum</t>
          </w>

Am I overlooking something? :-(

kosloot commented 3 years ago

Hi, revisiting this again: I don't see the space from FoLiA-abby after 'auch'.

This seems a bug/oversight I will look into it

kosloot commented 3 years ago

And thus the t-hbr is still a boundary between "vocabu" and "lorum". So ucto (and folia2html) make separate items from it, i.e.

This is a more fundamental problem. The FoLiA more or less reflects the nature of the input. vocabu¬ and larum are on separate lines, ans so they show up in separate <t-str>. too, which I place on separate lines But according to the new text-normalization rules defined by @proycon a newline will show up.

There are several escapes possible.

  1. Maybe it is possible to stuff both <t-str> into one, loosing the information that they were on separate lines
  2. Maybe it is possible to output the <t-str>'s adjacent, like </t-str><t-str>. This might omit the newline in text extraction.

Which road would be best? I would opt for the second one, I guess.

As a side-note, the '¬' is output using the --keephyphens option, so maybe that is usable?

pirolen commented 3 years ago

The --addbreaks option preserves the newline information in the original data, so luckily that is taken care of already.

The space issue for multiple t-style elements was also solved by putting them on one line (i.e., option 2 -- plus introduced the t-hspace, if needed). Would this also work for t-str?

kosloot commented 3 years ago

Hi, revisiting this again: I don't see the space from FoLiA-abby after 'auch'.

This seems a bug/oversight I will look into it

I think this is fixed now. The problem was, that I didn't expect a space to be in a different font style then the characters before it.

kosloot commented 3 years ago

The space issue for multiple t-style elements was also solved by putting them on one line (i.e., option 2 -- plus introduced the t-hspace, if needed). Would this also work for t-str?

I implemented a solution. For now ONLY when the '¬' is present. I is also possible to do this for the 'normal' hyphen '-'. What would be wise?

pirolen commented 3 years ago

Awesome, thanks for both of the changes! If the normal hyphen is the last character on the line, I would say it is also safe to do so. Will be happy to test this.

kosloot commented 3 years ago

I gave it a shot. Please test.

pirolen commented 3 years ago

The output looks very good, thank you! Will keep using/testing.

kosloot commented 3 years ago

Closing this issue, as it is a long messy thread of issues, most off them solved.