Closed iiegn closed 1 year ago
just realized, instead of pass
it's necessary to set text = ""
...
To get hyphenated content as in the original file
if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('CONTENT') + ' '
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('CONTENT') + ' '
should avoid duplicates.
Thank you @imlabormitlea-code! I will test this also next week and if all checks out, make a PR and credit you for the fix (i.e. unless you want to make a PR for this).
Basically, there seem to be two ways to deal with the hyphenation info from the ALTO xml.
Looking at a live example from the ALTO LOC website:
<TextLine ...>
<String STYLEREFS="ID7" HEIGHT="72.0" WIDTH="276.0" HPOS="6150.0" VPOS="5388.0" CONTENT="aver" SUBS_TYPE="HypPart1" SUBS_CONTENT="averAge" WC="1.0">
<HYP WIDTH="1.0" HPOS="6427.0" VPOS="5388.0" CONTENT="-"/>
</TextLine>
<TextLine ...>
<String STYLEREFS="ID7" HEIGHT="90.0" WIDTH="213.0" HPOS="4053.0" VPOS="5556.0" CONTENT="age" SUBS_TYPE="HypPart2" SUBS_CONTENT="averAge" WC="1.0">
</TextLine>
Where ALTO contains a string with SUBS_TYPE="HypPart1"
, use the SUBS_CONTENT
to substitute the hyphenated word in the text output.
SUBS_CONTENT
must be used only once, the second part of the hyphenated content should be ignored, e.g. as suggested by https://github.com/cneud/alto-tools/issues/16#issuecomment-1180407894.
Code:
if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('SUBS_CONTENT') + ' '
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
text = ''
Output:
averAge
Output closer to the original as e.g. suggested by https://github.com/cneud/alto-tools/issues/16#issuecomment-1339628976, use the CONTENT
of each string with the HYP
as -
and a text line break
This
Code:
if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('CONTENT') + '-'
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
text = line.attrib.get('CONTENT') + ' '
Output:
aver-
age
Note the correct capitalisation compared to using the substitution content (this is from the ALTO source).
I am not sure I have a preference for how this should be handled by default. Thoughts?
i guess, there is no right answer to this. however, once converted to text it's not obvious what is what: (from a small sample and more intuition than fact observation) i've seen cases where the end-of-line hyphenation was not recognized by the OCR engine but the text representation still contained a hyphen at the end of the line. these lines would be indistinguishable from https://github.com/cneud/alto-tools/issues/16#issuecomment-1439069439 option 2). otoh, when the OCR engine does recognize an end-of-line hyphenation, it seems to be often correct. for me, as i'm interested in the text for further NLP processing, having some words back to 'normal' (that is, known to NLP tools) form is a plus; and knowing that they are often correct is also a plus.
so, i have a preference for option 1)... but as long as there is a run-time option to select one or the other, i don't really care. this begs the question (then for me), should this become a run-time option?
Many thanks for sharing your thoughts on this @iiegn!
I am leaning more towards option 2. My thinking is that alto-tools
should itself aim not to alter/normalize the OCR text in any way - and while the HypPart
is indeed a part of the OCR output (I assume it was added in ALTO for the reason you describe, ease of downstream NLP processing with normalized tokens), I now see it as more problematic when by substituting hyphens also line breaks may be altered - this can cause trouble for applications that use some form of textline-image alignement, like evaluation or training. And I suppose it could be a greater source for confusion and require more documentation than keeping the hyphenation from the source document. The fact that the capitalization (averAge
vs aver-age
) is different in the example ALTO I used is also still confusing me. I don't expect an OCR engine would do that, it must be some post-processing...
TL;DR - a runtime option would satisfy more downstream use cases at the cost of increased documentation.
I guess I will implement option 2 first as the default. But I can also see the value in option 1 and would like to implement it, just not very soon (too much other stuff going on). PR's are obviously welcome :) Also example test files with hyphenation edge cases would be helpful if they can be shared freely.
while using alto_text(), i see duplicated content at hyphenations: the word at the end of one line and at the beginning of the next are identical (un-hyphenated) words. it seems,
SUBS_CONTENT
gets used from bothHypPart1
andHypPart2
.looking into this, i'd assume the indentation of the the block: https://github.com/cneud/alto-tools/blob/e942f86523414d2c0c5cfdfd5f280850acbdfecd/alto_tools.py#L66-L67 is one too many?