UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
176 stars 23 forks source link

page to hocr: cr_carea vs ocr_carea #183

Closed jbarth-ubhd closed 4 months ago

jbarth-ubhd commented 4 months ago

When converting OCR-D *.PAGE.xml to .hocr, I'll get different 2 types of <div> classes:

jb@pers109:/digitalisate2/PoCoTo/duerer1527--von-run-6/ocr> fgrep -h cr_carea *|sed 's/title=.*//'|sort|uniq -c
    564          <div class="cr_carea" 
    965          <div class="ocr_carea" 
jbarth-ubhd commented 4 months ago

oops perhaps old version:

-rwxr-xr-x 1 root root 3007 Jan 18 2022 /usr/local/bin/ocr-transform

jbarth-ubhd commented 4 months ago

make all & make install are complaining about missing JPageConverter 1.5.06 ; this helped:

root@pers16:/home/jb/ocr-fileformat/vendor# cp -a JPageConverter\ 1.5 "JPageConverter 1.5.06"
jbarth-ubhd commented 4 months ago

Did git pull && make && make install (with circumventing JPageConverter 1.5.06, see above),

same problem:

jb@pers109:/home/jb/ocr-fileformat# ocr-transform --version
ocr-transform v0.6.0-11-gee488dd
jbarth-ubhd commented 4 months ago

@stweil perhaps something is going wrong with JPageConverter (see above)

commit 63de5ae7ae0f91365d16e77e1f3bd468eb819054
    Use fixed JPageConverter 1.5.06 from UB-Mannheim
stweil commented 4 months ago

I cannot reproduce the issue:

ocr-transform page hocr vendor/page-to-alto/tests/data/OCR-D-OCR-TESS_00001.xml  | fgrep -h cr_carea | sed 's/title=.*//' | sort | uniq -c
     31          <div class="ocr_carea" 
stweil commented 4 months ago

Try git status. Are all submodules up-to-date?

jbarth-ubhd commented 4 months ago
root@pers109:/home/jb/ocr-fileformat# git status
On branch master
Your branch is up to date with 'origin/master'.

nothing to commit, working tree clean
jbarth-ubhd commented 4 months ago

Example: https://digi.ub.uni-heidelberg.de/diglitData/v/duerer1527_-_aa.PAGE.xml

root@pers109:/dwork/ocr/duerer1527/run-6# /usr/local/bin/ocr-transform page hocr duerer1527_-_aa.PAGE.xml  |grep '"cr_'
         <div class="cr_carea" title="bbox 144 141 554 189">
jbarth-ubhd commented 4 months ago

did rm -rf ... ; git clone ... ; make all ; make install - problem still there.

jbarth-ubhd commented 4 months ago

additionally did a git checkout v0.6.0 - but then make all complains ... AttributeError: 'NoneType' object has no attribute 'get'

stweil commented 4 months ago

It's a feature. Image regions and graphic regions get cr_carea while text regions get ocr_carea (see code).

jbarth-ubhd commented 4 months ago

Thanks! Looked like a typo.