more intuitive ID for output file, #26

OCR-D / ocrd_tesserocr

Run tesseract with the tesserocr bindings with @OCR-D's interfaces

MIT License

39 stars 11 forks source link

more intuitive ID for output file, #26 #27

Closed kba closed 5 years ago

kba commented 5 years ago

Generate the output ID and filename from the input file ID reduced to its numbers.

@finkf

kba commented 5 years ago

Could this lead to problems

Absolutely, yes.

The alternative is to not change the ID at all and accept that it gets slightly long, e.g. OCR-D-OCR-TESS_OCR-D-IMG-BIN-TESS_1234.

finkf commented 5 years ago

or maybe even better:

ID = concat_padded(self.output_file_grp, os.path.basename(input_file.url)[:-4])

why generate ids if output_file_grp + basename of file without extension is unique?

bertsky commented 5 years ago

I am very much in favour of the solution by @finkf, but I would also like to keep the .xml extension in the old version (because most PAGE viewers rely on it). The patch does not apply anymore, so should I make a new PR?

bertsky commented 5 years ago

Closing as this is superseded (and hopefully resolved to satisfaction) by #48.