add new gt metadata yml files

HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents

https://htr-united.github.io

Creative Commons Zero v1.0 Universal

36 stars 31 forks source link

add new gt metadata yml files #143

Open tboenig opened 4 months ago

tboenig commented 2 months ago

Hello, can you please commit the PR. Thank you.

PonteIneptique commented 2 months ago

Dear @tboenig and @bertsky, I am very happy that the collaboration is going forward, but I must say I am a bit lost about the extent of the dataset, what seems like duplication of corpora (??). Could you provide a little more insight ?

Also: be careful with Goth script code, it's for Gothic Language (and not Runes as I said: https://en.wikipedia.org/wiki/Gothic_alphabet ). I think you mean Latf

bertsky commented 2 months ago

@PonteIneptique

but I must say I am a bit lost about the extent of the dataset, what seems like duplication of corpora (??)

What do you mean duplication? The various entries are subcorpora of https://github.com/OCR-D/gt_structure_all, which is basically a subcorpus of deutschestextarchiv.de split into smaller chunks.

I proposed aggregating them into a single dataset here. But since the metadata.yml files are generated via CI on our side (for each repo independently), that might be difficult to achieve...

tboenig commented 2 months ago

Hallo @PonteIneptique

Thank you for the rigorous check of the data records. I have changed Goth to Latf.

to @bertsky

https://github.com/OCR-D/gt_structure_all is a metarepo that links all datasets.

Maybe it should be considered for a future version of HTR-United, how such metarepos are represented in the catalog.

I suggest that first of all the datasets are published in the catalog. In a second or subsequent step, you can always make improvements.

Of course, the metadata/data must be correct. Thank you again for the check.

All the Bests tboenig

PonteIneptique commented 2 months ago

Dear both, Given that nothing differentiates each repository except its name (same authors, same language, same scripts, etc.), and given that their name are non-semantic, I would probably refuse such a "massive" push (19 files) for usability reasons. The meta-repo however is completely welcome.

bertsky commented 2 months ago

@PonteIneptique – understood, @tboenig is already working on a solution.

PonteIneptique commented 2 months ago

Thank you for your understanding :)