OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces
MIT License
39 stars 11 forks source link

File names #26

Closed finkf closed 5 years ago

finkf commented 5 years ago

There are three files {0006,0007,0008}.xml that all belong to the same filegroup gt. If I run ocrd-tesserocr-recognize on the filegroup gt, with output filegroup tess recognize searches for the files of the filegroup in the mets.xml file. If for some reason (files where not added to the workspace in nummerical order?) the files are not returned in numerical order - for example 0007, 0008, 0006 - recognize generates the files tess-0001.xml (0007.xml), tess-0002.xml (0008.xml) and tess-0003.xml (0006.xml).

This destroys the mapping between gt and ocr pages.

A simple solution would be to use:

self.workspace.add_file(
  ID=ID,
  file_grp=self.output_file_grp,
  basename=self.output_file_grp + '-' + os.path.basename(input_file.url),
  mimetype=MIMETYPE_PAGE,
  content=to_xml(pcgts),
)

to create the new files to the workspace.

kba commented 5 years ago

Sounds reasonable. I would change the ID though, so the mapping between filename and ID is retained.

finkf commented 5 years ago

That's OK as well.