Metadata for OCR models and/or OCR model training sets

wrznr commented 5 years ago

We need to define a set of metadata for OCR models including at least:

engine (inkl. version)
parameter setting for training
reference to OCR model training set
...

We need to define a set of metadata for OCR model training sets including at least:

information on the training materials
(output) character sets
license
...

wrznr commented 5 years ago

@VolkerHartmann Relevant for GT repository as well as model repository. @bertsky Relevent for post correction.

Additions to metadata entries and proposals for representation format(s) very much welcome.

VolkerHartmann commented 5 years ago

GT repository @bertsky: Which attributes will be important for the selection of GT records? I'm thinking of:

Font
Publication date
Print shop (?) (I have not seen this attribute yet but it could be helpful, couldn't it)
...

Model repository At the moment no collection (for training set) can be created and therefore not referenced. This feature is planned for future versions. Until then all pages/data have to be listed. How willl the parameter look like? (To be most generic a key-value implementation would be appropriate) information on the training materials: Part of the GT metadata. e.g. publishing date, language, fonts, ...

bertsky commented 5 years ago

@VolkerHartmann Sorry, I am not so sure what it is you are asking me for. This issue is about OCR model meta-data, and I already find the list of features for that mentioned by @wrznr in the original post sufficient for post-correction purposes. Are you actually addressing #85 here? And what does "selection of GT records" refer to (the selection of features for GT meta-data records perhaps)?

VolkerHartmann commented 5 years ago

If the list of features is sufficient, that's fine.

wrznr commented 5 years ago

@VolkerHartmann In which format can necessary metadata be sufficiently (i.e. in a formal, machine-readable way) defined?

VolkerHartmann commented 5 years ago

Most formats are easy to parse. I would prefer JSON or XML but key-value pairs are also ok if no hierarchy exists.

kba commented 5 years ago

https://github.com/kba/ocr-models/blob/master/schema/description.schema.yml

wrznr commented 5 years ago

@wrznr develops a proposal based on the above schema.

wrznr commented 5 years ago

@wrznr Push.

cneud commented 5 years ago

Just to let you know that I've been told today that PMML is the widely accepted standard to describe ML models. It is XML-based. Perhaps we can learn/borrow some things from there.

VolkerHartmann commented 5 years ago

https://github.com/kba/ocr-models/blob/master/schema/description.schema.yml format: hdf5, pyrnn pronn,...

HDF5 is a container format but not a format of the model, right? It could contain any models. Is pyrnn a widely known standard extension? I can't find any information about that. We could add PMML as a possible format.

Landing page for the model or homepage of the creator

In most cases the creator will be an algorithm.

I am missing information about the underlying font, language variants (optional) to select the appropriate model. In addition I would prefer the model as defined in PMML: (see MODEL-ELEMENT) e.g.: "NeuralNetwork" and information on which algorithms it can be used for (Ok, KRAKEN is compatible with ocropus) Are there other algorithms we could use later? I think the format defined in description.schema.yml links both.

If the model is described in PMML a consumer have to support all variants? In the future, there could be importers and exporters for different algorithms. When the time comes, we can always store the models as PMML. :-)

mittagessen commented 5 years ago

What is the status on this? I've hacked together a zenodo-based thingy that I uses the metadata schema of the old repository but that is clearly insufficient.

If we're still on the schema proposed by @kba I would suggest some additions and changes. For one adding a field pointing to a training data set (by URL or PID) is somewhat important and putting in at least a CER measurement might also be prudent.

With regards to using PMML, I'm not sure how/if it is beneficial to describe OCR models on a functional level as all engines come with their own format, effectively making the model files opaque blobs. A functional description also doesn't aid in any way in model selection/implementation matching.

wrznr commented 5 years ago

@kba @tboenig @wrznr have a meeting on this issue next week. We'll get back to you asap.

wrznr commented 5 years ago

https://github.com/kba/mollusc/blob/master/spec/training-schema.yml

mittagessen commented 5 years ago

The repository isn't public.

kba commented 5 years ago

@mittagessen See https://github.com/OCR-D/spec/pull/105

wrznr commented 5 years ago

@kba Can we involve @Doreenruirui here? She has a specification ready, right?

cneud commented 5 years ago

@wrznr @kba @Doreenruirui This is pretty progressed https://github.com/Doreenruirui/okralact/tree/master/docs, https://github.com/Doreenruirui/okralact/tree/master/engines/schemas, no?

Doreenruirui commented 5 years ago

Hi Clemens,

Yes, the schemas are designed according to the documentation of the parameters of each engine. They are mainly used to verify the parameters when a user upload a configuration file.

Best, Rui

Clemens Neudecker notifications@github.com 于2019年5月22日周三上午12:49写道：

@wrznr https://github.com/wrznr @kba https://github.com/kba @Doreenruirui https://github.com/Doreenruirui This is pretty close https://github.com/Doreenruirui/okralact/tree/master/docs, https://github.com/Doreenruirui/okralact/tree/master/engines/schemas, no?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/OCR-D/spec/issues/86?email_source=notifications&email_token=ACEQARBS6N5CAYONHECKUL3PWR36XA5CNFSM4F3GHRJKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODV5M3XQ#issuecomment-494587358, or mute the thread https://github.com/notifications/unsubscribe-auth/ACEQARGIDZ42DZDJTY73GGTPWR36XANCNFSM4F3GHRJA .

kba commented 5 years ago

See https://github.com/Calamari-OCR/calamari/blob/master/calamari_ocr/ocr/datasets/dataset.py for the base class of datasets (image+transcription tuples) in calamari

mittagessen commented 4 years ago

I would like to restart the discussion on this as I've got a scalable-ish model repository working but the metadata schema used right now is insufficient powerful (both for print and manuscripts). The current state is here. It is already designed in a way to support multiple recognition engines through a free-text field in a searchable property. Each engine would define their own identifiers, ideally with different suffixes for functionally different model types, so multi- or cross-engine software would be able to effectively filter for supported models.

Currently, there are two requirements missing:

proper automatic model selection support
reproducibility

My suggestion is to incorporate an opaque blob that encapsulates hyperparameters in a way that OCR engines or a third party software like okralact can re-instantiate a model from scratch. This allows us

For automatic model selection there should be the ability to encode script (already in there), transcription levels, some kind of validation/test loss/error curve(s), and references directly to training data (if publicly available) or at least source material. To incorporate the methods the FAU team have developed we should also incorporate some kind of global script type embedding. It might be advisable to allow multiple of these, as the FAU system is currently fairly specific to the material OCR-D concerns itself with while other people might have more specific embeddings.

OCR-D / spec

Metadata for OCR models and/or OCR model training sets #86