Should we start thinking about a description system for HTR/OCR models?

alix-tz commented 1 year ago

If we do so, I think it would be interesting to embark @mittagessen in the conversation.

Should we do this in schema?

PonteIneptique commented 1 year ago

I am a bit leaning against it to be honnest, but I understand that people feel the need for it.

Yes, if we have this kind of feature, it should be done in schema :)

alix-tz commented 1 year ago

Can you quickly explain the reasoning against it? 👀

PonteIneptique commented 1 year ago

Models revolve and are not sustainable long term, unlike data.

Software change could break easily the model, or even simply new version of models could come up every few months, making curation of those a nightmare. Sure, data might get updated since their cataloging but are still "viable": generally, the format, the guidelines do not change. We might one day not be accurate on our statistics but still point to datasets that can be used.

On the other hand, cataloging models also feels mostly like a Kraken task. I do not mean this is the job of Ben or anyone else saying that, what I mean is that you cannot share models for Transkribus (so what's the point of cataloging them ?), and other software have their own perks, versions, stuff like that. So, it feels like we'd have mostly a Catalog of Kraken Models or we should prepare for vastly complex situations and figure out what can be recorded or not (If I develop a notebook with my own system and create a model, should I be able to record my model on Htr-United catalog ?).

However, I see the point of advertising a better way models that are trained, as not a lot of people try the kraken download (if I remember the command correctly) command, and as, AFAIK, eScriptorium does not provide description fields and public model on its UI. But this would be again very Kraken oriented (maybe a little bit of Kalamari ?)

tboenig commented 1 year ago

Models revolve and are not sustainable long term, unlike data.

That's right. But (only) metadata is recorded. Users of the GT might also be interested at models. By recording data about models, users have the opportunity to use or expand them, or simply to take note of them. The models or the software with which these can be used will develop further. In general, this is clear to everybody. Another aspect, the training requires IT infrastructure and energy with these should be used carefully by everyone. See my suggestion: https://tboenig.github.io/gt-metadata/document-your-gt.html

PonteIneptique commented 1 year ago

@tboenig I like what you did.

I still think that if we want to document models, we should separate it from the Dataset record (and have Model records). First, because some models are aggregating models (eg. CREMMA Medieval, CREMMA Manu McFrench, GalliCorpora models). Second, because they are going to be quickly out of date.

BUT, I think it would be great to reuse our catalog of records to allow people to link their model to multiple dataset (lots of work in terms of UI, but generally doable).

On the top of my head, properties should include (bold required):

Title
Description
DOI Link
Project
Authors
Used datasets
Manuscript / Print / Both (Simpler than what we have for dataset)
Software (Name, Link, Version)
Languages
Scripts
Known characters
License
Encoding

alix-tz commented 1 year ago

I suggest we move this discussion to https://github.com/HTR-United/schema!

HTR-United / htr-united

Should we start thinking about a description system for HTR/OCR models? #91