Closed PonteIneptique closed 1 year ago
I think this address everything from the peer review @alix-tz and @malamatenia, tell me what you think ? I can take care of parameters, Zenodo README import and Biblissima, as well as text export. For the rest, can I count on one of you ?
What can be done with HTRed document:
Here's a though dump for the reuse potential :
Some common data are already used in : https://deepai.org/publication/open-source-handwritten-text-recognition-on-medieval-manuscripts-using-mixed-models-and-document-specific-finetuning for mixed models train in Gothic and Bastarda cursives.
Further/future use of data (Segmentation and Transcription GT):
Line level segmentation and transcription of manuscripts , (especially mention of script) can be used for analytical purposes/classification of scripts via examination of morphological features : confrontation with the palaeographical doctrine : such is the purpose of the CreMe project https://oriflamms.hypotheses.org/1885 (with useful bibliography). The CLaMM (Classification of Latin Medieval Manuscripts) corpus and the different Competitions on the Classification of Medieval Handwritings in Latin Script have already shown the potential of such exploitations : eg https://clamm.irht.cnrs.fr/icdar-2017/icdar2017-clamm/ and notably https://github.com/mikekestemont/DeepScript
Segmentation GT can be also used for example for finetuning off-the-shelf DeepLearning (unsupervised) segmentation approaches for historical documents such as the "docExtractor": https://arxiv.org/pdf/2012.08191.pdf (or other pre-trained segmentation models cf. the Gallicorpora ones with use of SegmOnto)
Non document specific HTRed manuscripts rich in metadata such as line density, chronology, provenance and genre --> creation (alongside other publicly available data) of "subcorpora" in order to test palaeographic theories such as linkage between scripts and specific genres. One case study : The ECMEN corpus with dated manuscript samples in ancient french of different genres https://github.com/oriflamms/ECMEN curated by Stutzmann and al. of the IRHT present the potential of exploiting a "[...] corpus représentatif de la production écrite médiévale, en tenant compte des différents facteurs qui peuvent l’influencer (chronologie, géographie, contexte de production, typologie textuelle et langue" (https://www.irht.cnrs.fr/fr/recherche/les-programmes-de-recherche/ecmen). An equivalent corpus in Latin can be used to test scribal/writing practices (e.g. use of abbreviations) against the statistic results of Stutzmann for comparative purposes.
-An extra "Medii Aevi" feature that can be eventually be useful/exploited is the SegmOnto annotation of marginal vs. main zones, rubriques and interlinear information -(-with augmentation of the dataset-) to test theories of text/paratext script differentiation (cf. Bischoff) by isolating these lines via their tags.
Eventually a finetuning of more robust models for bilingual texts in Latin and vernacular languages even though I feel this is already a thing.
J'arrive après la bataille mais je viens de relire l'article après vos corrections et ça me paraît tout bon !
Je submit tout à l'heure du coup ;)
Paper itself
Data