UB-Mannheim / ocr-fileformat

Validate and transform various OCR file formats (hOCR, ALTO, PAGE, FineReader)
https://digi.bib.uni-mannheim.de/ocr-fileformat/
MIT License
181 stars 22 forks source link

Conversion to ALTO-2.0 is invalid #26

Closed FoxKyong closed 8 years ago

FoxKyong commented 8 years ago

I tried conversion from hOCR to ALTO-2.0 and after that when I tried ocr-validate on that file I got:

mXSDFilename: /usr/local/share/ocr-fileformat/xsd/alto-2-0.xsd
mXMLFilename: /data/000144300/060.alto
/data/000144300/060.alto fails to validate because: 

cvc-pattern-valid: Value '' is not facet-valid with respect to pattern '([a-zA-Z]{1,8})(-[a-zA-Z0-9]{1,8})*' for type 'language'.
At: 1:934

I also tried to convert it to other versions of ALTO but that all failed but it was just for testing because I need version 2.0.

stweil commented 8 years ago

Which hOCR did you use for that test? Could you please add it here to allow reproducing the problem?

FoxKyong commented 8 years ago

I have attached the file. But the same problem is caused by every hOCR I tried to convert. hOCR is created by Tesseract v3.04.01. 060.hocr.zip

kba commented 8 years ago

Thanks for trying @FoxKyong and for asking for ALTO support in tesseract.

Problem is in https://github.com/kba/hOCR-to-ALTO/, I'll look into it.

kba commented 8 years ago

The problem was with mapping language. Should be fixed in https://github.com/kba/hOCR-to-ALTO/issues/1. Can you try

(cd vendor/hOCR-to-ALTO; git pull)

and try the transformation/validation again?

FoxKyong commented 8 years ago

I have tried it and it works. Thanks.