OCR-D / core

Collection of OCR-related python tools and wrappers from @OCR-D
https://ocr-d.de/core/
Apache License 2.0
119 stars 31 forks source link

parse fails to validate result of to_xml #269

Closed bertsky closed 4 years ago

bertsky commented 5 years ago

I get a regression with 1.0.0b11: The call to page_from_file fails at ocrd_models_generateds.parse on a file previously generated by ocrd_models.ocrd_page.to_xml. (It mocks in validate_ConfSimpleType that the value is a str instead of a number.)

This is what I did:

ocrd-asv-ann-evaluate -m $mets -I OCR-D-GT-SEG-LINE,OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP,OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP

where all the OCR file grps are from a previous recognize processor in a long chain that runs through ok. See here for what the processor does.

This is what happens:

16:05:16.373 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0001
16:05:16.375 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.378 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.381 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.383 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.385 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0001
16:05:16.387 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0002
16:05:16.389 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.391 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.393 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.396 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.399 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0002
16:05:16.401 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0003
16:05:16.402 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.405 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.407 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.410 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.412 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0003
16:05:16.415 DEBUG processor.EvaluateLines - adding input file group OCR-D-GT-SEG-LINE to page phys_0004
16:05:16.417 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.419 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-OCRO-frakturjze-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.422 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-Fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.424 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.427 DEBUG processor.EvaluateLines - adding input file group OCR-D-OCR-TESS-frk+deu-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP to page phys_0004
16:05:16.430 INFO processor.EvaluateLines - processing page phys_0001
16:05:16.431 INFO processor.EvaluateLines - INPUT FILE for OCR-D-GT-SEG-LINE: OCR-D-GT-SEG-LINE_0001
16:05:16.465 INFO processor.EvaluateLines - INPUT FILE for OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP: OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001
Traceback (most recent call last):
  File "/home/xbert/unsortiert/arbeit/heyer/tools/ocrd_tesserocr/env3/bin/ocrd-asv-ann-evaluate", line 11, in <module>
    load_entry_point('ocrd-cor-asv-ann', 'console_scripts', 'ocrd-asv-ann-evaluate')()
  File "click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/cli.py", line 16, in ocrd_cor_asv_ann_evaluate
    return ocrd_cli_wrap_processor(EvaluateLines, *args, **kwargs)
  File "ocrd/decorators.py", line 38, in ocrd_cli_wrap_processor
    run_processor(processorClass, ocrd_tool, mets, workspace=workspace, **kwargs)
  File "ocrd/processor/base.py", line 65, in run_processor
    processor.process()
  File "/home/xbert/unsortiert/arbeit/heyer/ocr-d/cor-asv-ann/.gitworktree-master/ocrd_cor_asv_ann/wrapper/evaluate.py", line 71, in process
    pcgts = page_from_file(self.workspace.download_file(input_file))
  File "ocrd_modelfactory/__init__.py", line 71, in page_from_file
    return parse(input_file.local_filename, silence=True)
  File "ocrd_models/ocrd_page_generateds.py", line 11222, in parse
    rootObj.build(rootNode)
  File "ocrd_models/ocrd_page_generateds.py", line 1069, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 1084, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 2406, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 2544, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 11073, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 11155, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3057, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3122, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3446, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3499, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 3776, in build
    self.buildChildren(child, node, nodeName_)
  File "ocrd_models/ocrd_page_generateds.py", line 3837, in buildChildren
    obj_.build(child_)
  File "ocrd_models/ocrd_page_generateds.py", line 4013, in build
    self.buildAttributes(node, node.attrib, already_processed)
  File "ocrd_models/ocrd_page_generateds.py", line 4030, in buildAttributes
    self.validate_ConfSimpleType(self.conf)    # validate type ConfSimpleType
  File "ocrd_models/ocrd_page_generateds.py", line 3934, in validate_ConfSimpleType
    if value < 0:
TypeError: '<' not supported between instances of 'str' and 'int'

The incriminated PAGE-XML is OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001.xml.gz. It validates fine under http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15.

bertsky commented 5 years ago

This is a real showstopper. It effectively breaks all further processing of OCR results. And ocrd_tesserocr master is now dependent on b11...

bertsky commented 5 years ago

NB: JPageViewer 1.3 does render the file correct after replacing 2019 with 2018 and removing Page/@orientation.

@wrznr Have you experienced anything similar yet?

bertsky commented 5 years ago

BTW, it does help to manually remove all TextEquiv/@conf.

kba commented 5 years ago

Sorry about that, will try to fix ASAP. I updated generateDS before regenerating the page API, maybe something changed about how the @conf attribute is parsed...

mikegerber commented 5 years ago

I have the same problem, using ocrd-tesserocr. Workaround:

xmlstarlet ed --inplace \
  -N 'page=http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15' \
  -d '//page:TextEquiv/@conf' OCR-D-OCR-TESS/*
kba commented 5 years ago

The pertinent diff in the generated code:

-            try:
-                self.conf = float(value)
-            except ValueError as exp:
-                raise ValueError('Bad float/double attribute (conf): %s' % exp)
+            self.conf = value
+            self.validate_ConfSimpleType(self.conf)    # validate type ConfSimpleType

There is not more casting to float in the current code. Hence all of

set_conf("1")
set_conf(int(1))
set_conf(1.0)

are accepted and stored as str, int and float as-is but only the third one is valid. Investigating at which version between 2.30.11 and 2.33.1 this changed and whether it can be re-enabled.

kba commented 5 years ago

Problem first appeared in the 2.31.1 release. I could not find a setting to make this configurable, so for now I'll revert generateDS to 2.30.11 and publish another beta 12 that is the same except for how the PAGE API is generated.

kba commented 4 years ago

I see lots of fixes for conversion between xsd: types and python primitives in generateDS 2.35.9. I won't update the generated code now because regressions from this are the last thing we need at the moment but we will revisit and fix this as soon as the final workshop is over.

kba commented 4 years ago

I've regenerated the PAGE API in #437 with generateDS 2.35.13 and the type issues are fixed. I've tried to recreate your initial problem and could not with test-269.zip. @bertsky Can you try #437 and/or have any pointers what I should test for to avoid future regressions?

kba commented 4 years ago

@bertsky can this be closed?

bertsky commented 4 years ago

I am afraid the current version now (due to the missing NS prefix) mixes elements with prefix (unchanged from input) and without (new elements), which our validator checks fine but PageViewer rejects. Open a new issue?

bertsky commented 4 years ago

which our validator checks fine

But in fact these are invalid, because no prefix is only allowed when you have an xmlns=DEFAULT-NS-URL in the header.

but PageViewer rejects

PageViewer is okay with core-generated PAGE-XML when I add a default xmlns.

bertsky commented 4 years ago

Also, I cannot revert to 2.5.1 because there have not been git tags (only GH releases) since 2.5.0 ...

bertsky commented 4 years ago

@kba Since #443 is already merged, this is urgent.

kba commented 4 years ago

@kba Since #443 is already merged, this is urgent.

OK, I'm looking into it. Namespace prefixes be damned.

Also, I cannot revert to 2.5.1 because there have not been git tags (only GH releases) since 2.5.0 ...

That is strange. Are you sure you did git pull --tags? Our releases are always based on a tag.

bertsky commented 4 years ago

That is strange. Are you sure you did git pull --tags? Our releases are always based on a tag.

Oh sorry – you're right of course. I did not. (I was under the impression that they are fetched automatically, and I have to disable that via --no-tags. Turns out these are different 'kinds' of tag. Stupid git interfaces – I used to be so happy with mercurial...)

bertsky commented 4 years ago

Solved by #474 (but hopefully also upstream in generateDS some day).