Closed pirolen closed 3 years ago
In principle this can be achieved rather easily. FoLiA even has a special language annotation to use here. But unfortunately, it is not allowed to append those to text-markup nodes like t-style. @proycon that would be a nice extension.
I CAN add them as a feature to the t-style nodes like this:
<t class="OCR">
<t-str xml:id="FA-piroska.text.div1.p6.t-str.1">
<t-style><feat class="GermanStandard" subset="language"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{96A59073-7007-4008-A803-6C3663B19E8A}" subset="font_style"/>Grundherrschaften der Kaiserzeit</t-style>
<t-style><feat class="GermanStandard" subset="language"/><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{96A59073-7007-4008-A803-6C3663B19E8A}" subset="font_style"/>* 1</t-style>
</t-str>
</t>
(in this made-up example, the isn't a change in language, an probably this is the default, so some optimizing may be possible, but I hope you get the gist.)
I'm a bit reluctant to add too many specific text markup variants for all kinds of annotation because that's not what text markup is for, still, something like <t-lang>
might be an exception still that's worth considering.
The verbose way and semantically accurate way that already exists is to create <str>
elements that refer to a specific portion of the text (marked by <t-str>
) and then you attach whatever further annotation you desire, <lang>
in this case, in the <str>
element.
The alternative Ko suggested is also possible already, but that would be a convention left open to interpetation rather than actually semantically encoding the language. Considering language a styling issue is a bit off in my opinion (a simple t-str would be less semantically awkward).
It is very baroque already, unfortunately.
Adding an extra layer of <str>
is not very appealing to me, but maybe we must do it. Allowing a LangAnnotation is still my preference.
For the example above that would be:
<t class="OCR">
<t-str xml:id="FA-piroska.text.div1.p6.t-str.1">
<t-style><lang>GermanStandard</lang><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{96A59073-7007-4008-A803-6C3663B19E8A}" subset="font_style"/>Grundherrschaften der Kaiserzeit</t-style>
<t-style><lang>GermanStandard</lang><feat class="superscript" subset="font_typeface"/><feat class="Times New Roman" subset="font_family"/><feat class="15." subset="font_size"/><feat class="{96A59073-7007-4008-A803-6C3663B19E8A}" subset="font_style"/>* 1</t-style>
</t-str>
</t>
short and clean imho
Context: I came across documents with fonts that are of several alphabets (or encode some very basic phonetics), and I had to set several languages for Abbyy to get the diacritics right. Then the language ID change gets very frequent. I agree, this is not styling info.
I wonder which solution would make post-processing the FoLiA easier :-) but maybe this is not really related. In the usecase that I am testing, the FoLiA-abby output gets tokenized, marked up in FLAT if there is wrong style info, the corrections post-processed to change the style info, for which they need to refer back to the nontokenized strings, etc.
Allowing a LangAnnotation is still my preference.
No, that would mess things up way too much. lang
is inline annotation, can't be in text content (just like pos tags etc can't be in text content). These are all separated for a reason :) There shouldn't be XML text in a
A dedicated <t-lang>
attribute is still on the table as a clean an easy solution (but only in untokenised contexts where you have language-context switches!).
At what level will such an element be inserted?
as a child of the <t-style>
?
Or of the <str>
. (in which case multiple elements should be allowed, Or the <t-str>
has to be split in 2 in the above example)
There's no strict ordering regarding text markup elements, meaning that all of them can be either children or parents. So you can make it either a child or parent of either t-style or t-str.
Just signaling that the language info from Abbyy is often wrong, but by the provenance info one is able to track down the source I assume. (Annotators/FoLiA tools might want to override/correct it.)
I'm getting a bit lost now. Maybe it's best to take the simplest solution and add the language info as a feature. That is easy and no new FoLiA hackery is required.
(If so, the feature could simply be "ocr_lang".)
Maybe the user could opt for/against adding the OCR language info? Opting out could mean that if the only "style"/formatting difference between consecutive text parts would be the change in language, they would be in the same t-style element...
I created a folia2.5 branch that uses the new t-lang feature (requires libfolia's folia2.5 an recompiling Ucto branch too) Next step in that branch will be starting to use the new 'tag' attribute. Tagging mechanism to aid processors
(I guess we can merge the libfolia folia2.5 branch into master already, it seems mature enough and only needs finishing touches?)
In the
<formatting>
element of Abbyy XML, there is an attribute for language information, e.g.lang="GermanStandard"
. It would be nice to have this as a feature subset in the resulting FoLiA.Not only because of the language info, but also since if there is a change in recognized language, there is a new t-style in the FoLiA, and the reason for the 'fragmentation' is not visible.
E.g.
in Abbyy:
Thank you very much.