Closed pstroe closed 1 year ago
Hello! It looks all good to me as far as the description is concerned, however, I tried taking a look at your data (via Zenodo) and it seems that the .zip files are empty. Could you check that your files are correct?
hi alix, the github repo is pretty big (50gb), and i just enabled synching with github on zenodo, so i don't think it will all be loaded on zenodo. do you think a remark in the readme concerning the download (only possible via github lfs) would solve the issue?
Even 50Go is big for Github to be honest, but yes, you should definitely mention it somewhere, otherwise people might think the dataset is not usable. I was initially trying to download the zip files individually from Github but it was just loading endlessly, that's why I switched to Zenodo.
If I may, I would also recommend offering a small zip file as a sort of preview, with maybe 5 images and the corresponding XML files to allow people to download them and try to the dataset before downloading the whole 50Go, in case it doesn't correspond to what they want.
Hi both, I would just like to add that it is possible to edit automatic upload to add files manually: it might be something to look into if the automatic linking between Zenodo and Github did not work :)
Are the images available somewhere else, like on a IIIF server or a public server? If so, and as long as that image server is reliable, you could reduce the size of the repo by just listing how to access the images and leaving only the transcriptions in the repository. Or, you could split the XML files (which would stay in the repo) and use another solution for hosting the images.
@PonteIneptique do you have any cents to give on this?
thanks you both for your suggestions. i will upload a small sample of the training data tomorrow. the images are only available from our servers (they are preprocessed and, so to speak, ready to use). external storage would be possible, i guess, but it will again take time to set up.
i have added a sample now. i'm aware that it is not common to store so much data in a git repo, but with the lfs also the download should work without problems. i added a specific remark in the github repo as how to clone this repo.
any updates? :-)
@alix-tz has this PR for her, but the sheer mass of text has been a tough one to review. This is still on her plate (but she's on vacation today if I am not mistaken). We did not forget you ;)
Hello @pstroe!
Sorry for the late response, I got stuck trying to unzip the archives in the repository (I managed to do it ok eventually), it made me forget to get back to you.
Your dataset raises a question which is not simple to answer quickly: we looked at the samples (thank you for not cherry picking) and found at least 3 or 4 lines which were incomplete (words missing) or "overcomplete" (additional text). As I understand from the description, it makes sense since the data was produced by automatically aligning the image and the transcription.
We think we would need to set up a special tag for this kind of dataset (like "silver data"?) before we can add it to the catalog. Is it ok for you if this PR remains pending while we find a way to address this kind of situation?
Ok, we finally got to it, welcome to Bullinger :)
integration of new dataset