cneud / alto-tools

Python tools for performing various operations on ALTO XML files
Apache License 2.0
39 stars 15 forks source link

Support for hyphenated words in ALTO? #9

Closed TuulaP closed 3 years ago

TuulaP commented 4 years ago

Neat tools, nice to have them available!

Noticed now that there is a thing now with hyphenated words, which we have a lot in Finnish and also in Swedish. I don't know if it is as intended, but they come now to the output in parts and not combined.

<String ID="P1_ST01183" HPOS="2372" VPOS="3307" WIDTH="183" HEIGHT="39" CONTENT="Täydellisem" SUBS_TYPE="HypPart1" SUBS_CONTENT="Täydellisempää" WC="0.99" CC="44545880035"/>
                        <HYP HPOS="2555" VPOS="3307" WIDTH="16" CONTENT="-"/>
                    </TextLine>
                    <TextLine ID="P1_TL00207" HPOS="1774" VPOS="3346" WIDTH="797" HEIGHT="43">
                        <String ID="P1_ST01184" HPOS="1774" VPOS="3348" WIDTH="60" HEIGHT="39" CONTENT="pää" SUBS_TYPE="HypPart2" SUBS_CONTENT="Täydellisempää" WC="0.99" CC="100"/>
<SP ID="P1_SP00976" HPOS="1834" VPOS="3387" WIDTH="24"/>

At the moment e.g. the alto_text function takes the CONTENT part, so the hyphenated words come out as separated tokens.

E.g. I run the original version vs. version where they are combined, the difference is visible: 'Täydellisempää' vs. 'Täydellisem' 'pää'. But depends on whether to follow the text line 'boundaries' or have more readable output, and there can be votes to the either solution.

207,208c207,208
< roilja paremmin kestää hallaaki. Täydellisempää
< sala-ojitusta talonpoikain pelloilla ei saata
---
> roilja paremmin kestää hallaaki. Täydellisem
> pää sala-ojitusta talonpoikain pelloilla ei saata
211,216c211,216
cneud commented 4 years ago

Hi @TuulaP thanks for your feedback and glad you found this (still not fully implemented) script useful!

Regarding the treatment of hyphenation, I agree with you that for downstream processing (and ime that's the most common case of converting ALTO-XML to plain text) it would be preferable to treat the hyphenation (by default) so that the words appear in the output text without the hyphens. Later on this could also be e.g. set by a parameter.

I am not sure though when I can find the time to adapt the script accordingly - any suggestions/PR are therefore very welcome!

cneud commented 4 years ago

note-to-self: adapt treatment of hyphens from https://github.com/KBNLresearch/europeananp-ner/blob/master/alto_tools/alto_to_text.py#L119

cneud commented 3 years ago

Hi @TuulaP, I saw you found a solution here. I tried and tested this with some of my files and hope it's ok if I use this here as well. Thanks a lot!