cneud / alto-tools

Python tools for performing various operations on ALTO XML files
Apache License 2.0
39 stars 15 forks source link

Hyphenated content gets duplicated #16

Closed iiegn closed 1 year ago

iiegn commented 2 years ago

while using alto_text(), i see duplicated content at hyphenations: the word at the end of one line and at the beginning of the next are identical (un-hyphenated) words. it seems, SUBS_CONTENT gets used from both HypPart1 and HypPart2.

looking into this, i'd assume the indentation of the the block: https://github.com/cneud/alto-tools/blob/e942f86523414d2c0c5cfdfd5f280850acbdfecd/alto_tools.py#L66-L67 is one too many?

iiegn commented 2 years ago

just realized, instead of pass it's necessary to set text = ""...

imlabormitlea-code commented 1 year ago

To get hyphenated content as in the original file

      if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
          text = line.attrib.get('CONTENT') + ' '
      if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
          text = line.attrib.get('CONTENT') + ' '

should avoid duplicates.

cneud commented 1 year ago

Thank you @imlabormitlea-code! I will test this also next week and if all checks out, make a PR and credit you for the fix (i.e. unless you want to make a PR for this).

cneud commented 1 year ago

Basically, there seem to be two ways to deal with the hyphenation info from the ALTO xml.

Looking at a live example from the ALTO LOC website:

<TextLine ...>
  <String STYLEREFS="ID7" HEIGHT="72.0" WIDTH="276.0" HPOS="6150.0" VPOS="5388.0" CONTENT="aver" SUBS_TYPE="HypPart1" SUBS_CONTENT="averAge" WC="1.0">
  <HYP WIDTH="1.0" HPOS="6427.0" VPOS="5388.0" CONTENT="-"/>
</TextLine>
<TextLine ...>
  <String STYLEREFS="ID7" HEIGHT="90.0" WIDTH="213.0" HPOS="4053.0" VPOS="5556.0" CONTENT="age" SUBS_TYPE="HypPart2" SUBS_CONTENT="averAge" WC="1.0">
</TextLine>

1) substitute hyphens

Where ALTO contains a string with SUBS_TYPE="HypPart1", use the SUBS_CONTENT to substitute the hyphenated word in the text output.

SUBS_CONTENT must be used only once, the second part of the hyphenated content should be ignored, e.g. as suggested by https://github.com/cneud/alto-tools/issues/16#issuecomment-1180407894.

Code:

if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('SUBS_CONTENT') + ' '
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
   text = ''

Output:

averAge

2) don't substitute hyphens

Output closer to the original as e.g. suggested by https://github.com/cneud/alto-tools/issues/16#issuecomment-1339628976, use the CONTENT of each string with the HYP as - and a text line break

This

Code:

if ('HypPart1' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('CONTENT') + '-'
if ('HypPart2' in line.attrib.get('SUBS_TYPE')):
   text = line.attrib.get('CONTENT') + ' '

Output:

aver-
age

Note the correct capitalisation compared to using the substitution content (this is from the ALTO source).

I am not sure I have a preference for how this should be handled by default. Thoughts?

iiegn commented 1 year ago

i guess, there is no right answer to this. however, once converted to text it's not obvious what is what: (from a small sample and more intuition than fact observation) i've seen cases where the end-of-line hyphenation was not recognized by the OCR engine but the text representation still contained a hyphen at the end of the line. these lines would be indistinguishable from https://github.com/cneud/alto-tools/issues/16#issuecomment-1439069439 option 2). otoh, when the OCR engine does recognize an end-of-line hyphenation, it seems to be often correct. for me, as i'm interested in the text for further NLP processing, having some words back to 'normal' (that is, known to NLP tools) form is a plus; and knowing that they are often correct is also a plus.

so, i have a preference for option 1)... but as long as there is a run-time option to select one or the other, i don't really care. this begs the question (then for me), should this become a run-time option?

cneud commented 1 year ago

Many thanks for sharing your thoughts on this @iiegn!

I am leaning more towards option 2. My thinking is that alto-tools should itself aim not to alter/normalize the OCR text in any way - and while the HypPart is indeed a part of the OCR output (I assume it was added in ALTO for the reason you describe, ease of downstream NLP processing with normalized tokens), I now see it as more problematic when by substituting hyphens also line breaks may be altered - this can cause trouble for applications that use some form of textline-image alignement, like evaluation or training. And I suppose it could be a greater source for confusion and require more documentation than keeping the hyphenation from the source document. The fact that the capitalization (averAge vs aver-age) is different in the example ALTO I used is also still confusing me. I don't expect an OCR engine would do that, it must be some post-processing...

TL;DR - a runtime option would satisfy more downstream use cases at the cost of increased documentation.

I guess I will implement option 2 first as the default. But I can also see the value in option 1 and would like to implement it, just not very soon (too much other stuff going on). PR's are obviously welcome :) Also example test files with hyphenation edge cases would be helpful if they can be shared freely.