Minor editions for the paper

PonteIneptique commented 1 year ago

Paper itself

[x] Explain the parameters with which the model was trained
[x] One minor suggested correction: would "axe" on p. 7 be better as "vertical axis"? To me at least, "axis" seems better.
[x] Explain the reuse potential:
- [x] Reviewer A asks for us to show how these data have already been used and how it can be used. Cite DH2019 paper and look for other kind of similar papers
- [x] I don't see what else to do, but my first approach does not really expand on "how or whether scholars who are not immediately connected to the project might use this model."

Data

[x] Improve the documentation and presentation of the dataset on Zenodo
- [x] Documentation is not existant: import the README.md of Github
- [x] Improve said documentation with link to the Biblissima registry (That's bonus but sounds like a good idea)
- [x] Talk about the paper in the presentation of the dataset
[x] Provide .txt exports in Zenodo / Github

PonteIneptique commented 1 year ago

I think this address everything from the peer review @alix-tz and @malamatenia, tell me what you think ? I can take care of parameters, Zenodo README import and Biblissima, as well as text export. For the rest, can I count on one of you ?

PonteIneptique commented 1 year ago

What can be done with HTRed document:

Speed-up transcription [Mostly "non-technical" people]
- https://ciham.msh-lse.fr/node/2035
- https://www.unige.ch/c7s/
Transcription for data mining with text studies as a goal
- https://academic.oup.com/dsh/article/36/Supplement_2/ii49/6421789
Transcription for data mining with writing practice studies as a goal
- https://shs.hal.science/halshs-03560918v1/bibtex
- https://shs.hal.science/halshs-01778620v1/bibtex

malamatenia commented 1 year ago

Here's a though dump for the reuse potential :

Some common data are already used in : https://deepai.org/publication/open-source-handwritten-text-recognition-on-medieval-manuscripts-using-mixed-models-and-document-specific-finetuning for mixed models train in Gothic and Bastarda cursives.
Further/future use of data (Segmentation and Transcription GT):

Line level segmentation and transcription of manuscripts , (especially mention of script) can be used for analytical purposes/classification of scripts via examination of morphological features : confrontation with the palaeographical doctrine : such is the purpose of the CreMe project https://oriflamms.hypotheses.org/1885 (with useful bibliography). The CLaMM (Classification of Latin Medieval Manuscripts) corpus and the different Competitions on the Classification of Medieval Handwritings in Latin Script have already shown the potential of such exploitations : eg https://clamm.irht.cnrs.fr/icdar-2017/icdar2017-clamm/ and notably https://github.com/mikekestemont/DeepScript

Segmentation GT can be also used for example for finetuning off-the-shelf DeepLearning (unsupervised) segmentation approaches for historical documents such as the "docExtractor": https://arxiv.org/pdf/2012.08191.pdf (or other pre-trained segmentation models cf. the Gallicorpora ones with use of SegmOnto)

Use of metadata:

Non document specific HTRed manuscripts rich in metadata such as line density, chronology, provenance and genre --> creation (alongside other publicly available data) of "subcorpora" in order to test palaeographic theories such as linkage between scripts and specific genres. One case study : The ECMEN corpus with dated manuscript samples in ancient french of different genres https://github.com/oriflamms/ECMEN curated by Stutzmann and al. of the IRHT present the potential of exploiting a "[...] corpus représentatif de la production écrite médiévale, en tenant compte des différents facteurs qui peuvent l’influencer (chronologie, géographie, contexte de production, typologie textuelle et langue" (https://www.irht.cnrs.fr/fr/recherche/les-programmes-de-recherche/ecmen). An equivalent corpus in Latin can be used to test scribal/writing practices (e.g. use of abbreviations) against the statistic results of Stutzmann for comparative purposes.

-An extra "Medii Aevi" feature that can be eventually be useful/exploited is the SegmOnto annotation of marginal vs. main zones, rubriques and interlinear information -(-with augmentation of the dataset-) to test theories of text/paratext script differentiation (cf. Bischoff) by isolating these lines via their tags.

Use of the model (other the fact that pre-trained models are less energy and data consuming) :

Eventually a finetuning of more robust models for bilingual texts in Latin and vernacular languages even though I feel this is already a thing.

alix-tz commented 1 year ago

J'arrive après la bataille mais je viens de relire l'article après vos corrections et ça me paraît tout bon !

PonteIneptique commented 1 year ago

Je submit tout à l'heure du coup ;)

HTR-United / CREMMA-Medieval-LAT

Minor editions for the paper #11

Paper itself

Data