HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
37 stars 31 forks source link

Add "file" counts for a few datasets #87

Open alix-tz opened 1 year ago

alix-tz commented 1 year ago

@PonteIneptique do you have any objection to adding the following informations in the catalog:

Dataset name (xml) file count
Handwritten Text Recognition Ground Truth Set: StABS Ratsbücher O10, Urfehdenbuch X 201
Charters and Records of Königsfelden Abbey and Bailiwick (1308-1662) 283
The POPP datasets 235
Eutyches 129
FoNDUE-GasparoSardiToponomasia-Dataset 49
FoNDUE Spanish chapbooks 19th c. Dataset 198
Éditer la correspondance de Constance de Salm (1767-1845) 45
Jeu de données OCR - Incunables sévillans 1494-1500 62
Données vérité de terrain HTR+ Annuaire des propriétaires et des propriétés de Paris et du département de la Seine (1898-1923) 169

I went through each of these repositories to count the number of XML files corresponding to ground truth. Note that for "Handwritten Text Recognition Ground Truth Set: StABS Ratsbücher O10, Urfehdenbuch X", I only counted the PAGE files (all the ALTO files have a PAGE equivalent, which is not true the other way around). Same for "Données vérité de terrain HTR+ Annuaire des propriétaires et des propriétés de Paris et du département de la Seine (1898-1923)".

If we add these metrics, we would have the "file" metric available for every dataset currently listed in the catalog.

PonteIneptique commented 1 year ago

No issue. You can also, if you want, run humg. I usually do it for external datasets when I have time....

Le lun. 31 oct. 2022 à 11:58 PM, Alix Chagué @.***> a écrit :

@PonteIneptique https://github.com/PonteIneptique do you have any objection to adding the following informations in the catalog: Dataset name (xml) file count Handwritten Text Recognition Ground Truth Set: StABS Ratsbücher O10, Urfehdenbuch X 201 Charters and Records of Königsfelden Abbey and Bailiwick (1308-1662) 283 The POPP datasets 235 Eutyches 129 FoNDUE-GasparoSardiToponomasia-Dataset 49 FoNDUE Spanish chapbooks 19th c. Dataset 198 Éditer la correspondance de Constance de Salm (1767-1845) 45 Jeu de données OCR - Incunables sévillans 1494-1500 62 Données vérité de terrain HTR+ Annuaire des propriétaires et des propriétés de Paris et du département de la Seine (1898-1923) 169

I went through each of these repositories to count the number of XML files corresponding to ground truth. Note that for "Handwritten Text Recognition Ground Truth Set: StABS Ratsbücher O10, Urfehdenbuch X", I only counted the PAGE files (all the ALTO files have a PAGE equivalent, which is not true the other way around). Same for "Données vérité de terrain HTR+ Annuaire des propriétaires et des propriétés de Paris et du département de la Seine (1898-1923)".

If we add these metrics, we would have the "file" metric available for every dataset currently listed in the catalog.

— Reply to this email directly, view it on GitHub https://github.com/HTR-United/htr-united/issues/87, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAOXEZXRF5B27W66ERBFCJDWGBFHPANCNFSM6AAAAAARTRJL4U . You are receiving this because you were mentioned.Message ID: @.***>