Closed TuulaP closed 3 years ago
Hi @TuulaP thanks for your feedback and glad you found this (still not fully implemented) script useful!
Regarding the treatment of hyphenation, I agree with you that for downstream processing (and ime that's the most common case of converting ALTO-XML to plain text) it would be preferable to treat the hyphenation (by default) so that the words appear in the output text without the hyphens. Later on this could also be e.g. set by a parameter.
I am not sure though when I can find the time to adapt the script accordingly - any suggestions/PR are therefore very welcome!
note-to-self: adapt treatment of hyphens from https://github.com/KBNLresearch/europeananp-ner/blob/master/alto_tools/alto_to_text.py#L119
Neat tools, nice to have them available!
Noticed now that there is a thing now with hyphenated words, which we have a lot in Finnish and also in Swedish. I don't know if it is as intended, but they come now to the output in parts and not combined.
At the moment e.g. the alto_text function takes the CONTENT part, so the hyphenated words come out as separated tokens.
E.g. I run the original version vs. version where they are combined, the difference is visible: 'Täydellisempää' vs. 'Täydellisem' 'pää'. But depends on whether to follow the text line 'boundaries' or have more readable output, and there can be votes to the either solution.