HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
38 stars 32 forks source link

Create bullinger-htr-dataset.yml #102

Closed pstroe closed 1 year ago

pstroe commented 1 year ago

integration of new dataset

alix-tz commented 1 year ago

Hello! It looks all good to me as far as the description is concerned, however, I tried taking a look at your data (via Zenodo) and it seems that the .zip files are empty. Could you check that your files are correct?

pstroe commented 1 year ago

hi alix, the github repo is pretty big (50gb), and i just enabled synching with github on zenodo, so i don't think it will all be loaded on zenodo. do you think a remark in the readme concerning the download (only possible via github lfs) would solve the issue?

alix-tz commented 1 year ago

Even 50Go is big for Github to be honest, but yes, you should definitely mention it somewhere, otherwise people might think the dataset is not usable. I was initially trying to download the zip files individually from Github but it was just loading endlessly, that's why I switched to Zenodo.

If I may, I would also recommend offering a small zip file as a sort of preview, with maybe 5 images and the corresponding XML files to allow people to download them and try to the dataset before downloading the whole 50Go, in case it doesn't correspond to what they want.

PonteIneptique commented 1 year ago

Hi both, I would just like to add that it is possible to edit automatic upload to add files manually: it might be something to look into if the automatic linking between Zenodo and Github did not work :)

alix-tz commented 1 year ago

Are the images available somewhere else, like on a IIIF server or a public server? If so, and as long as that image server is reliable, you could reduce the size of the repo by just listing how to access the images and leaving only the transcriptions in the repository. Or, you could split the XML files (which would stay in the repo) and use another solution for hosting the images.

@PonteIneptique do you have any cents to give on this?

pstroe commented 1 year ago

thanks you both for your suggestions. i will upload a small sample of the training data tomorrow. the images are only available from our servers (they are preprocessed and, so to speak, ready to use). external storage would be possible, i guess, but it will again take time to set up.

pstroe commented 1 year ago

i have added a sample now. i'm aware that it is not common to store so much data in a git repo, but with the lfs also the download should work without problems. i added a specific remark in the github repo as how to clone this repo.

pstroe commented 1 year ago

any updates? :-)

PonteIneptique commented 1 year ago

@alix-tz has this PR for her, but the sheer mass of text has been a tough one to review. This is still on her plate (but she's on vacation today if I am not mistaken). We did not forget you ;)

alix-tz commented 1 year ago

Hello @pstroe!

Sorry for the late response, I got stuck trying to unzip the archives in the repository (I managed to do it ok eventually), it made me forget to get back to you.

Your dataset raises a question which is not simple to answer quickly: we looked at the samples (thank you for not cherry picking) and found at least 3 or 4 lines which were incomplete (words missing) or "overcomplete" (additional text). As I understand from the description, it makes sense since the data was produced by automatically aligning the image and the transcription.

We think we would need to set up a special tag for this kind of dataset (like "silver data"?) before we can add it to the catalog. Is it ok for you if this PR remains pending while we find a way to address this kind of situation?

PonteIneptique commented 1 year ago

Ok, we finally got to it, welcome to Bullinger :)