OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
119 stars 31 forks source link

simple types in PAGE model are broken #451

Closed bertsky closed 4 years ago

bertsky commented 4 years ago

This is a regression from #437:

>>> regions[0].get_type()
'page-number'
>>> isinstance(regions[0].get_type(), str)
True
>>> isinstance(TextTypeSimpleType.PAGENUMBER, str)
False
>>> TextTypeSimpleType.PAGENUMBER == 'page-number'
False
>>> regions[0].set_type(TextTypeSimpleType.PAGENUMBER)
>>> regions[0].get_type()
<TextTypeSimpleType.PAGENUMBER: 'page-number'>
>>> str(TextTypeSimpleType.PAGENUMBER)
'TextTypeSimpleType.PAGENUMBER'

The latter is also what is used for XML serialization. This in turn causes invalid PAGE output when using ocrd-tesserocr-segment-region, which afterwards even without validation causes tons of error messages of the following form:

Warning: Value "TextTypeSimpleType.HEADING" near line 56 does not match xsd enumeration restriction on TextTypeSimpleType

This is super-urgent. I recommend doing a revert release first and then start proper investigation.

bertsky commented 4 years ago

@kba can you please revert 3a0a3a8351124020bea127e9ff15e3ba63541f8f from #437 and make a new release, so we at least have a working master?

kba commented 4 years ago

https://github.com/OCR-D/core/commit/3a0a3a8351124020bea127e9ff15e3ba63541f8f reverted as a quickfix in v2.4.3. That reopens the issue with @conf but you're mitigating that already so this is the least worst solution. Will need to revisit to properly fix and not introduce more regressions.

bertsky commented 4 years ago

3a0a3a8 reverted as a quickfix in v2.4.3. That reopens the issue with @conf but you're mitigating that already so this is the least worst solution. Will need to revisit to properly fix and not introduce more regressions.

19afb8d08947a5ad39f6e30c33dd59b2bada7cea is not a true revert of 3a0a3a8351124020bea127e9ff15e3ba63541f8f, and it does not fix this unfortunately!

The reason seems to be that you ran generateds with the bad new (instead of the good old) version again: this shows no difference other than the date (version and code stay the same), whereas that was the change we want/need to revert.