OCR4all / OCR4all

Provides OCR (Optical Character Recognition) services through web applications
MIT License
235 stars 30 forks source link

generated XML without text? #130

Open jbarth-ubhd opened 2 years ago

jbarth-ubhd commented 2 years ago

with this model and »Generate word level Page XML output« checked and not checked:

ocr4all/models/custom# md5sum historical_french-2020-10-14/0/*
ecd83bdb5ddcc8965b8400cfb200a063  historical_french-2020-10-14/0/0.ckpt.h5
db6795bda982c04df343aa8c96379b9b  historical_french-2020-10-14/0/0.ckpt.json
bbc30499bb6a63652e960a6d392e0e5e  historical_french-2020-10-14/0/1.ckpt.h5
e7285ce37d8e1b0b42b4a5c63c8c3659  historical_french-2020-10-14/0/1.ckpt.json
996a92e0dab35f141b5ba303e3613b04  historical_french-2020-10-14/0/2.ckpt.h5
cf6efafd4f411140045cd8d4f36bdd80  historical_french-2020-10-14/0/2.ckpt.json
d76601ace7cb43c3cb8ed46a778de965  historical_french-2020-10-14/0/3.ckpt.h5
abdf36278fa976d4663e34952b692c01  historical_french-2020-10-14/0/3.ckpt.json
579419b364bff2f70975f35dac2ca95b  historical_french-2020-10-14/0/4.ckpt.h5
b787d4ab5ef9eb7e0f400a6fda78ddc5  historical_french-2020-10-14/0/4.ckpt.json

:

<!-- ocr4all/data/montfaucon1719bd2_1/processing# cat 0495.xml|sed 's/ points="[^"]*"//g;s/</\n</g'|head -20 -->
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
<Metadata>
<Creator>User123
</Creator>
<Created>2021-12-02T08:46:54
</Created>
<LastChange>2021-12-02T08:46:54
</LastChange>
</Metadata>
<Page imageFilename="0495.png" imageHeight="6073" imageWidth="3728">
<TextRegion id="r0" type="paragraph" orientation="-0.625">
<Coords/>
<TextLine id="r0_l001">
<Coords/>
</TextLine>
<TextLine id="r0_l002">
<Coords/>
</TextLine>
<TextLine id="r0_l003">
...
maxnth commented 2 years ago

Excuse the late reply, I totally overlooked this issue.

Was this model ensemble historical_french-2020-10-14 trained by you from scratch? I've encountered a similar issue when I didn't use enough training data for Calamari.

If that isn't the case, does this only occur when using OCR4all or also with stand alone Calamari?

jbarth-ubhd commented 2 years ago

The model is copied from here: https://github.com/Calamari-OCR/calamari_models/tree/16630e34ed77e7d6fa735c2505c82c081dbeb42a/historical_french

— the previous "historical_french", not the latest one (would be v5, does not work with OCR4all):