OCR-D / zenhub

Repo for developing zenhub integration
Apache License 2.0
0 stars 0 forks source link

Create template repo #73

Open krvoigt opened 2 years ago

krvoigt commented 2 years ago

Current Situation. There is no flexible and adequate repository for storing ground truth data. Especially the aspects of collaborative creation, control and versioning are very difficult to realize with the current solution.

How things should be Storing ground truth data is not just about simply storing data. Ground truth data consists on the one hand of digital originals (digital copies, images) and the transcription data created for them. The data volume of the digital copies compared to the transcriptions is larger. For this reason it is to be assumed that GT data are offered by providers without digital copies. Among other things, the digital copies are only referenced with references (URL/URN). But for a permanent storage of the data it is to be assumed that also these data should be stored in the long term in a repository. For this reason the following requirements have to be realized:

Steps

Preliminary work. Several projects currently store GT data in individual github repositories. These data either have no metadata or the readme described the datasets verbally or articles described the creation, as well as the handling of this data within a project or GT data is stored in Zenodo or the project HTR-United (https://github.com/HTR-United/htr-united.github.io) provides solutions for GT data for handwriting recognition as well as the project OCR & Ground Truth Resources (https://github.com/cneud/ocr-gt) lists GT repositories/datasets.

tboenig commented 2 years ago

The following tools are proposed and provided for the solution of a GT repository:

Data anlysis, transfer of digitized data, creation of missing METS metadata records, creation of bagits.

Metadata schema, data form for capturing GT data collections or corpora.

Templates for a GitHub repository for storing GT data. Integrated with this are data analysis, transfer of missing digital images, creation of METS metadata records and bagits. The bagits are provided as versioned releases.

krvoigt commented 2 years ago

@tdoan2010 review