Open alix-tz opened 2 years ago
I think the software should be one of the first thing to appear, because if I'm using Transkribus, I won't care that model X or Y are able to handle French if they are Kraken models.
Now that raises an important question: given that Transkribus already provides a page listing public transcription models (https://readcoop.eu/transkribus/public-models/), do we want to also cover Transkribus models?
Personnally, I would lean in favor of it[^why], but it makes things a little more complicated: for example License, Ecoding and DOI[^doi] might be impossible to fill for Transkribus models.
[^why]: Because 1) it might attract Transkribus users who didn't think of sharing their data/ground truth, 2) users might chose a software depending of the availability of models, 3) we can do better than the current metadata used by Transkribus.
[^doi]: No DOI in Transkribus but models do have a unique ID.
Sorry for only starting to participate now. Something that is rather important is a field that indicates the type of model, e.g. transcription, segmentation, reading order, ... in addition to the software so it is possible to filter according to what one is actually looking for without having to download individual models. That would probably require changing the semantics of the known characters
field to something like possible outputs
.
As @PonteIneptique correctly identified models are somewhat ephemeral. In my opinion we should at least provide guidelines on how to deal with that. One (not particularly well thought out) way could be to treat the record/DOI as a 'prototype' model for that dataset(s) for a particular software and publish replacement models, e.g. a tweaked architecture improving performance, as a version linked to that original model instead of creating a completely new record. This is primarily to reduce the noise level in any model repository but might have some other benefits as well such as incentivizing early publication of models.
Ah your comment reminds me that we should probably include a "date of creation" property!
Hello to All,
unfortunately I could not participate in the discussion. I would now like to continue the discussion. If I understood everything correctly, there should be
Both schemas are strongly related to each other in terms of content but have special features.
It can be stated, the schema for GT is currently stable. The schema for a model is under development.
My proposal for the description of metadata for a model was always based on the GT. Example scenario. GT was created and described with metadata. A model is created with this GT and this model is recorded in the metadata record.
Now, of course, there are other scenarios: I use
In the first case there should be a connection between model and GT. In the second case, I would think that it is actually new GT, which is
I have expressed this now first everything naturally linguistically, since I assume that the formal writing can be realized so more simply then.
See: https://github.com/HTR-United/htr-united/issues/91
an example provided by @tboenig : https://tboenig.github.io/gt-metadata/document-your-gt.html (it ties the description of the model to the description of the dataset)
a proposition from @PonteIneptique :