LanguageMachines / foliautils

Command-line utilities for working with the Format for Linguistic Annotation (FoLiA), powered by libfolia (C++), written by Ko van der Sloot (CLST, Radboud University)
https://proycon.github.io/folia
GNU General Public License v3.0
4 stars 3 forks source link

FoLiA-abby: include language info from Abbyy XML #57

Closed pirolen closed 3 years ago

pirolen commented 3 years ago

In the <formatting> element of Abbyy XML, there is an attribute for language information, e.g. lang="GermanStandard". It would be nice to have this as a feature subset in the resulting FoLiA.

Not only because of the language info, but also since if there is a change in recognized language, there is a new t-style in the FoLiA, and the reason for the 'fragmentation' is not visible.

E.g.

<t class="OCR">
          <t-str xml:id="FA-mittelalt_bibkat_sample_001.text.div1.p3.t-str.1">
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="11." subset="font_size"/><feat class="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}" subset="font_style"/>Summa </t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="11." subset="font_size"/><feat class="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}" subset="font_style"/>de </t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="11." subset="font_size"/><feat class="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}" subset="font_style"/>VIII </t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="11." subset="font_size"/><feat class="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}" subset="font_style"/>partibus </t-style>
            <t-style><feat class="Times New Roman" subset="font_family"/><feat class="11." subset="font_size"/><feat class="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}" subset="font_style"/>463 ef.<br/></t-style>
          </t-str>
        </t>

in Abbyy:

<par align="Justified" lineSpacing="1408" style="{92E8FD86-1D32-4617-9916-3FDFBAF682BE}">
<line baseline="161" l="89" t="115" r="534" b="156"><formatting lang="GermanStandard" ff="Times New Roman" fs="11." style="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}">
<charParams l="89" t="119" r="103" b="142">S</charParams>
<charParams l="110" t="126" r="124" b="142">u</charParams>
<charParams l="131" t="126" r="153" b="142" suspicious="1">m</charParams>
<charParams l="161" t="124" r="182" b="144" suspicious="1">m</charParams>
<charParams l="189" t="126" r="201" b="142">a</charParams>
<charParams l="202" t="120" r="218" b="143"> </charParams></formatting><formatting lang="FrenchStandard" ff="Times New Roman" fs="11." style="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}">
<charParams l="219" t="120" r="234" b="143">d</charParams>
<charParams l="236" t="127" r="249" b="143">e</charParams>
<charParams l="250" t="120" r="259" b="143"> </charParams></formatting><formatting lang="GermanStandard" ff="Times New Roman" fs="11." style="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}">
<charParams l="260" t="120" r="280" b="143">V</charParams>
<charParams l="282" t="120" r="289" b="143">I</charParams>
<charParams l="291" t="120" r="298" b="143">I</charParams>
<charParams l="302" t="120" r="308" b="143">I</charParams>
<charParams l="309" t="120" r="325" b="150"> </charParams></formatting><formatting lang="Latin" ff="Times New Roman" fs="11." style="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}">
<charParams l="326" t="127" r="340" b="151">p</charParams>
<charParams l="343" t="127" r="355" b="144">a</charParams>
<charParams l="358" t="128" r="367" b="144">r</charParams>
<charParams l="369" t="123" r="375" b="144">t</charParams>
<charParams l="378" t="122" r="384" b="144">i</charParams>
<charParams l="386" t="121" r="401" b="144">b</charParams>
<charParams l="404" t="128" r="418" b="144">u</charParams>
<charParams l="421" t="128" r="432" b="144">s</charParams>
<charParams l="433" t="122" r="448" b="145"> </charParams></formatting><formatting lang="GermanStandard" ff="Times New Roman" fs="11." style="{8D1B0C68-E626-4BF1-A3FD-82E185D23C0B}">
<charParams l="449" t="123" r="463" b="146">4</charParams>
<charParams l="465" t="123" r="479" b="146">6</charParams>
<charParams l="480" t="123" r="494" b="146">3</charParams>
<charParams l="495" t="122" r="501" b="147" suspicious="1"> </charParams>
<charParams l="502" t="133" r="510" b="146" suspicious="1">e</charParams>
<charParams l="521" t="129" r="528" b="148">f</charParams>
<charParams l="531" t="145" r="534" b="149" suspicious="1">.</charParams></formatting></line></par>

Thank you very much.

kosloot commented 3 years ago

In principle this can be achieved rather easily. FoLiA even has a special language annotation to use here. But unfortunately, it is not allowed to append those to text-markup nodes like t-style. @proycon that would be a nice extension.

I CAN add them as a feature to the t-style nodes like this:

<t class="OCR">
  <t-str xml:id="FA-piroska.text.div1.p6.t-str.1">
    <t-style><feat class="GermanStandard" subset="language"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{96A59073-7007-4008-A803-6C3663B19E8A}" subset="font_style"/>Grundherrschaften der Kaiserzeit</t-style>
    <t-style><feat class="GermanStandard" subset="language"/><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{96A59073-7007-4008-A803-6C3663B19E8A}" subset="font_style"/>* 1</t-style>
  </t-str>
</t>

(in this made-up example, the isn't a change in language, an probably this is the default, so some optimizing may be possible, but I hope you get the gist.)

proycon commented 3 years ago

I'm a bit reluctant to add too many specific text markup variants for all kinds of annotation because that's not what text markup is for, still, something like <t-lang> might be an exception still that's worth considering.

The verbose way and semantically accurate way that already exists is to create <str> elements that refer to a specific portion of the text (marked by <t-str>) and then you attach whatever further annotation you desire, <lang> in this case, in the <str> element.

The alternative Ko suggested is also possible already, but that would be a convention left open to interpetation rather than actually semantically encoding the language. Considering language a styling issue is a bit off in my opinion (a simple t-str would be less semantically awkward).

kosloot commented 3 years ago

It is very baroque already, unfortunately. Adding an extra layer of <str> is not very appealing to me, but maybe we must do it. Allowing a LangAnnotation is still my preference. For the example above that would be:

<t class="OCR">
  <t-str xml:id="FA-piroska.text.div1.p6.t-str.1">
    <t-style><lang>GermanStandard</lang><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{96A59073-7007-4008-A803-6C3663B19E8A}" subset="font_style"/>Grundherrschaften der Kaiserzeit</t-style>
    <t-style><lang>GermanStandard</lang><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{96A59073-7007-4008-A803-6C3663B19E8A}" subset="font_style"/>* 1</t-style>
  </t-str>
</t>

short and clean imho

pirolen commented 3 years ago

Context: I came across documents with fonts that are of several alphabets (or encode some very basic phonetics), and I had to set several languages for Abbyy to get the diacritics right. Then the language ID change gets very frequent. I agree, this is not styling info.

I wonder which solution would make post-processing the FoLiA easier :-) but maybe this is not really related. In the usecase that I am testing, the FoLiA-abby output gets tokenized, marked up in FLAT if there is wrong style info, the corrections post-processed to change the style info, for which they need to refer back to the nontokenized strings, etc.

proycon commented 3 years ago

Allowing a LangAnnotation is still my preference.

No, that would mess things up way too much. lang is inline annotation, can't be in text content (just like pos tags etc can't be in text content). These are all separated for a reason :) There shouldn't be XML text in a that is not actual text, it'll complicate matters.

A dedicated <t-lang> attribute is still on the table as a clean an easy solution (but only in untokenised contexts where you have language-context switches!).

kosloot commented 3 years ago

At what level will such an element be inserted? as a child of the <t-style>? Or of the <str>. (in which case multiple elements should be allowed, Or the <t-str> has to be split in 2 in the above example)

proycon commented 3 years ago

There's no strict ordering regarding text markup elements, meaning that all of them can be either children or parents. So you can make it either a child or parent of either t-style or t-str.

pirolen commented 3 years ago

Just signaling that the language info from Abbyy is often wrong, but by the provenance info one is able to track down the source I assume. (Annotators/FoLiA tools might want to override/correct it.)

kosloot commented 3 years ago

I'm getting a bit lost now. Maybe it's best to take the simplest solution and add the language info as a feature. That is easy and no new FoLiA hackery is required.

pirolen commented 3 years ago

(If so, the feature could simply be "ocr_lang".)

pirolen commented 3 years ago

Maybe the user could opt for/against adding the OCR language info? Opting out could mean that if the only "style"/formatting difference between consecutive text parts would be the change in language, they would be in the same t-style element...

kosloot commented 3 years ago

I created a folia2.5 branch that uses the new t-lang feature (requires libfolia's folia2.5 an recompiling Ucto branch too) Next step in that branch will be starting to use the new 'tag' attribute. Tagging mechanism to aid processors

proycon commented 3 years ago

(I guess we can merge the libfolia folia2.5 branch into master already, it seems mature enough and only needs finishing touches?)