UB-Mannheim / AustrianNewspapers

NewsEye / READ OCR training dataset from Austrian Newspapers (1864–1911)
15 stars 3 forks source link

Question: Format of pull request #1

Open wollmers opened 4 years ago

wollmers commented 4 years ago

As I understand, updates and corrections are applied on both, the whole page XML and line text files.

Will look, if a quick hack to update XML pays off. I use my own JSON format based on hOCR and add information like font. Page-XML and the other popular formats are on my TODO list.

stweil commented 4 years ago

Fixing text in the PAGE XML is more difficult because simple search+replace can also destroy the XML code. Therefore currently only some fixes were made for both PAGE XML and line texts.

The current focus was on the line texts which are used for training, so those are more up to date.

Ideally we would have a script to get the line text updates back into the PAGE XML. Each line text is twice in the PAGE XML because that contains region and line texts.

wollmers commented 4 years ago

Maybe a quick hack is possible, using an XML-Parser and XPath. The elements in the XML have IDs, and the <TextLine> element can be located via filename and ID, then the <TextRegion> element via ancestor, then <TextEquiv>. If <TextEquiv> contains <TextLine> in the same order (array of lines), it can be changed. Not a 10-liner, more 200++ lines in Perl with recording changes automatically, manually check them, and patch line text plus XML.

When I have time, I will implement it in Perl and publish it on github. Porting to other languages (Python, Java) should be easy. I could also provide font information, but this needs a more detailed structure on word (theoretically letter) level. But update the current Arial, Times for Fraktur with the majority in a line is an improvement.

wollmers commented 4 years ago

Check the differences between page XML and line txt is now completed. Update is only a small but critical step. Prototype quality, not flexible for other formats: check_xml.pl

Here is a run:

~/github/ocr-hw/ocr-gt-AustrianNewspapers-scripts/scripts$ time perl check_xml.pl 
ERROR $TextEquiv not defined: ONB_ibn_19110701_008.tif TextRegion_id=region_1547026084602_43
Pages total: 161
Pages different: 160
Lines total: 57642
Lines different: 28895

real    0m31.588s
user    0m19.423s
sys 0m4.824s

I wonder a little bit about lines different 28895 / total lines 57642 = 50%. Would be interesting to compare against the original files of ONB (before transcription). I guess they have a CER of >20%. Unfortunately they have a completely unconvenient and unusable interface.

Also one TextRegion is empty in one page (ONB_ibn_19110701_008.tif):

      <TextRegion id="region_1547026084602_43" custom="readingOrder {index:115;}">
         <Coords points="3008,4591 3047,4591 3047,4649 3008,4649"/>
      </TextRegion>
wollmers commented 4 years ago

Running against the original files from Zenodo:

~/github/ocr-hw/ocr-gt-AustrianNewspapers-scripts/scripts$ time perl check_xml.pl 
ERROR $TextEquiv not defined: ONB_ibn_19110701_008.tif TextRegion_id=region_1547026084602_43
Pages total: 161
Pages different: 160
Lines total: 57642
Lines different: 31002

real    0m32.591s
user    0m19.887s
sys 0m4.862s

~2000 lines more differences.

wollmers commented 4 years ago

Can now update the xml-files. Unfortunately it reformats the XML and bloats the diffs at the first run.

The script is available here: https://github.com/wollmers/ocr-gt-AustrianNewspapers-scripts/blob/master/scripts/update_xml.pl