HDF5 dataset format: how to convert

Calamari-OCR / calamari

Line based ATR Engine based on OCRopy

Apache License 2.0

1.04k stars 209 forks source link

HDF5 dataset format: how to convert #317

Open bertsky opened 2 years ago

bertsky commented 2 years ago

I presume training on HDF5 will be more efficient than any of the other formats. And at least against the line GT file pairs, filesystem performance might be much better, too.

So my question is: how do I convert existing datasets into HDF5 format?

andbue commented 2 years ago

Hi Robert, at the moment there is no script that converts data from the command line. When running Cross-fold-train, the data is copied to hdf5 before the training starts, have a look here: https://github.com/Calamari-OCR/calamari/blob/3b1969bf8f2611080e99a3e361511548ac2ef4f0/calamari_ocr/ocr/training/cross_fold.py#L77-L90

For my own training, I've hacked together some lines of code at https://github.com/andbue/nashi/blob/master/ocr/nashi_ocr/nashi_client.py to save preprocessed data in a single hdf5 file, so I can re-run training and prediction the need for preprocessing the images again. If I had the time, it would be sensible to integrate some of that into calamari, I guess.

bertsky commented 2 years ago

Hi Andreas – thanks for your fast feedback!

I think I understood the writer part, but could you please fill me in on the reader side (for file pairs)? What's the minimal / best pattern to instantiate a data generator – scripts.dataset_viewer.DataWrapper perhaps?

andbue commented 2 years ago

That's where I would have started as well. Maybe a copy of dataset_viewer.py, setting PipelineMode.EVALUATION, writing sample.inputs and sample.targets to the Hdf5DatasetWriter instead of showing them in pyplot. If I'm not totally mistaken, this should work with all kinds of datasets. Just in case you end up with something helpful for other users as well: feel free to put it in a PR!

bertsky commented 2 years ago

Understood, thanks! I'll give it a try.