Open tboenig opened 4 months ago
Dear @tboenig and @bertsky, I am very happy that the collaboration is going forward, but I must say I am a bit lost about the extent of the dataset, what seems like duplication of corpora (??). Could you provide a little more insight ?
Also: be careful with Goth script code, it's for Gothic Language (and not Runes as I said: https://en.wikipedia.org/wiki/Gothic_alphabet ). I think you mean Latf
@PonteIneptique
but I must say I am a bit lost about the extent of the dataset, what seems like duplication of corpora (??)
What do you mean duplication? The various entries are subcorpora of https://github.com/OCR-D/gt_structure_all, which is basically a subcorpus of deutschestextarchiv.de split into smaller chunks.
I proposed aggregating them into a single dataset here. But since the metadata.yml files are generated via CI on our side (for each repo independently), that might be difficult to achieve...
Hallo @PonteIneptique
Thank you for the rigorous check of the data records. I have changed Goth to Latf.
to @bertsky
https://github.com/OCR-D/gt_structure_all is a metarepo that links all datasets.
Maybe it should be considered for a future version of HTR-United, how such metarepos are represented in the catalog.
I suggest that first of all the datasets are published in the catalog. In a second or subsequent step, you can always make improvements.
Of course, the metadata/data must be correct. Thank you again for the check.
All the Bests tboenig
Dear both, Given that nothing differentiates each repository except its name (same authors, same language, same scripts, etc.), and given that their name are non-semantic, I would probably refuse such a "massive" push (19 files) for usability reasons. The meta-repo however is completely welcome.
@PonteIneptique – understood, @tboenig is already working on a solution.
Thank you for your understanding :)
Hello, can you please commit the PR. Thank you.