Closed bertsky closed 4 years ago
This is a real showstopper. It effectively breaks all further processing of OCR results. And ocrd_tesserocr master is now dependent on b11...
NB: JPageViewer 1.3 does render the file correct after replacing 2019 with 2018 and removing Page/@orientation
.
@wrznr Have you experienced anything similar yet?
BTW, it does help to manually remove all TextEquiv/@conf
.
Sorry about that, will try to fix ASAP. I updated generateDS before regenerating the page API, maybe something changed about how the @conf attribute is parsed...
I have the same problem, using ocrd-tesserocr. Workaround:
xmlstarlet ed --inplace \
-N 'page=http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15' \
-d '//page:TextEquiv/@conf' OCR-D-OCR-TESS/*
The pertinent diff in the generated code:
- try:
- self.conf = float(value)
- except ValueError as exp:
- raise ValueError('Bad float/double attribute (conf): %s' % exp)
+ self.conf = value
+ self.validate_ConfSimpleType(self.conf) # validate type ConfSimpleType
There is not more casting to float in the current code. Hence all of
set_conf("1")
set_conf(int(1))
set_conf(1.0)
are accepted and stored as str
, int
and float
as-is but only the third one is valid. Investigating at which version between 2.30.11 and 2.33.1 this changed and whether it can be re-enabled.
Problem first appeared in the 2.31.1 release. I could not find a setting to make this configurable, so for now I'll revert generateDS to 2.30.11 and publish another beta 12 that is the same except for how the PAGE API is generated.
I see lots of fixes for conversion between xsd:
types and python primitives in generateDS 2.35.9
. I won't update the generated code now because regressions from this are the last thing we need at the moment but we will revisit and fix this as soon as the final workshop is over.
I've regenerated the PAGE API in #437 with generateDS 2.35.13 and the type issues are fixed. I've tried to recreate your initial problem and could not with test-269.zip. @bertsky Can you try #437 and/or have any pointers what I should test for to avoid future regressions?
@bertsky can this be closed?
I am afraid the current version now (due to the missing NS prefix) mixes elements with prefix (unchanged from input) and without (new elements), which our validator checks fine but PageViewer rejects. Open a new issue?
which our validator checks fine
But in fact these are invalid, because no prefix is only allowed when you have an xmlns=DEFAULT-NS-URL
in the header.
but PageViewer rejects
PageViewer is okay with core-generated PAGE-XML when I add a default xmlns.
Also, I cannot revert to 2.5.1 because there have not been git tags (only GH releases) since 2.5.0 ...
@kba Since #443 is already merged, this is urgent.
@kba Since #443 is already merged, this is urgent.
OK, I'm looking into it. Namespace prefixes be damned.
Also, I cannot revert to 2.5.1 because there have not been git tags (only GH releases) since 2.5.0 ...
That is strange. Are you sure you did git pull --tags
? Our releases are always based on a tag.
That is strange. Are you sure you did
git pull --tags
? Our releases are always based on a tag.
Oh sorry – you're right of course. I did not. (I was under the impression that they are fetched automatically, and I have to disable that via --no-tags
. Turns out these are different 'kinds' of tag. Stupid git interfaces – I used to be so happy with mercurial...)
Solved by #474 (but hopefully also upstream in generateDS some day).
I get a regression with 1.0.0b11: The call to
page_from_file
fails atocrd_models_generateds.parse
on a file previously generated byocrd_models.ocrd_page.to_xml
. (It mocks invalidate_ConfSimpleType
that the value is astr
instead of a number.)This is what I did:
where all the OCR file grps are from a previous recognize processor in a long chain that runs through ok. See here for what the processor does.
This is what happens:
The incriminated PAGE-XML is OCR-D-OCR-OCRO-fraktur-BINPAGE-wolf-DESKEW-tesserocr-CLIP-RESEG-DEWARP_0001.xml.gz. It validates fine under
http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15
.