HTR-United / schema

Repository for schema related business
Mozilla Public License 2.0
1 stars 1 forks source link

Let's start thinking about how to document models #16

Open alix-tz opened 2 years ago

alix-tz commented 2 years ago

See: https://github.com/HTR-United/htr-united/issues/91

On the top of my head, properties should include (* : required):

  • Title*
  • Description*
  • Software (Name, Link, Version)*
  • DOI Link*
  • Project
  • Authors
  • Used datasets
  • Manuscript / Print / Both (Simpler than what we have for dataset)*
  • Languages*
  • Scripts*
  • Known characters
  • License*
  • Encoding*
alix-tz commented 2 years ago

I think the software should be one of the first thing to appear, because if I'm using Transkribus, I won't care that model X or Y are able to handle French if they are Kraken models.

Now that raises an important question: given that Transkribus already provides a page listing public transcription models (https://readcoop.eu/transkribus/public-models/), do we want to also cover Transkribus models?

Personnally, I would lean in favor of it[^why], but it makes things a little more complicated: for example License, Ecoding and DOI[^doi] might be impossible to fill for Transkribus models.

[^why]: Because 1) it might attract Transkribus users who didn't think of sharing their data/ground truth, 2) users might chose a software depending of the availability of models, 3) we can do better than the current metadata used by Transkribus.

[^doi]: No DOI in Transkribus but models do have a unique ID.

mittagessen commented 2 years ago

Sorry for only starting to participate now. Something that is rather important is a field that indicates the type of model, e.g. transcription, segmentation, reading order, ... in addition to the software so it is possible to filter according to what one is actually looking for without having to download individual models. That would probably require changing the semantics of the known characters field to something like possible outputs.

As @PonteIneptique correctly identified models are somewhat ephemeral. In my opinion we should at least provide guidelines on how to deal with that. One (not particularly well thought out) way could be to treat the record/DOI as a 'prototype' model for that dataset(s) for a particular software and publish replacement models, e.g. a tweaked architecture improving performance, as a version linked to that original model instead of creating a completely new record. This is primarily to reduce the noise level in any model repository but might have some other benefits as well such as incentivizing early publication of models.

alix-tz commented 2 years ago

Ah your comment reminds me that we should probably include a "date of creation" property!

tboenig commented 1 year ago

Hello to All,

unfortunately I could not participate in the discussion. I would now like to continue the discussion. If I understood everything correctly, there should be

Both schemas are strongly related to each other in terms of content but have special features.

It can be stated, the schema for GT is currently stable. The schema for a model is under development.

My proposal for the description of metadata for a model was always based on the GT. Example scenario. GT was created and described with metadata. A model is created with this GT and this model is recorded in the metadata record.

Now, of course, there are other scenarios: I use

  1. a very specific GT and create only one model or
  2. different GTs are combined by me, merged to one GT and a model is created.

In the first case there should be a connection between model and GT. In the second case, I would think that it is actually new GT, which is

  1. gets an independent metadata set + model metadata set.
  2. but in the standalone metadata record it is noted that this record is based on GT.... was created.

I have expressed this now first everything naturally linguistically, since I assume that the formal writing can be realized so more simply then.