HTR-United / htr-united

Ground Truth Resources for the HTR of patrimonial documents
https://htr-united.github.io
Creative Commons Zero v1.0 Universal
36 stars 31 forks source link

Create biblia-arabica-example-code.yml #129

Closed nathangibson closed 1 month ago

alix-tz commented 8 months ago

Hello,

Thank you for this submission. Is it a work in progress or are you trying to submit it as is?

There are several problems which need to be fixed before the entry can be added to the catalog.

That being said, my main issue is actually that I am not able to load the dataset in eScriptorium. I get the following errors when I do (see below), which might be caused by the fact that the value in "fileName" does not match the names of the image files. I tried on two instances of eScriptorium (v0.13.8b and v0.13.4b) with the same result. Did you try to import them in eScriptorium? Which version of eScriptorium did you use to export them? Did you generate them all in the command with Kraken? If yes, with which version?

Note that some of the errors are normal, I didn't load all the images.

Import in biblia-arabica
Status: Finished
Queued at: Nov. 10, 2023, 3:23 p.m.
Started at: Nov. 10, 2023, 3:23 p.m.
Ended at: Nov. 10, 2023, 3:23 p.m.
CPU usage: 14.984762666666667
GPU usage: None

No match found for file laud-or-258-unvocalized_013.xml with filename "laud-or-258_013.jpg".
[...]
No match found for file laud-or-258-unvocalized_042.xml with filename "laud-or-258_042.jpg".
Processing the page n°1 from the provided METS file
An exception occurred while processing the page: Invalid URL 'None/laud-or-258_013.xml': No scheme supplied. Perhaps you meant https://None/laud-or-258_013.xml?
Processing the page n°2 from the provided METS file
An exception occurred while processing the page: Invalid URL 'None/laud-or-258_014.xml': No scheme supplied. Perhaps you meant https://None/laud-or-258_014.xml?
Processing the page n°3 from the provided METS file
An exception occurred while processing the page: Invalid URL 'None/laud-or-258_015.xml': No scheme supplied. Perhaps you meant https://None/laud-or-258_015.xml?
Processing the page n°4 from the provided METS file
An exception occurred while processing the page: Invalid URL 'None/laud-or-258_016.xml': No scheme supplied. Perhaps you meant https://None/laud-or-258_016.xml?
Processing the page n°5 from the provided METS file
An exception occurred while processing the page: Invalid URL 'None/laud-or-258_017.xml': No scheme supplied. Perhaps you meant https://None/laud-or-258_017.xml?
Processing the page n°6 from the provided METS file
An exception occurred while processing the page: Invalid URL 'None/laud-or-258_018.xml': No scheme supplied. Perhaps you meant https://None/laud-or-258_018.xml?
Processing the page n°7 from the provided METS file
An exception occurred while processing the page: Invalid URL 'None/laud-or-258_019.xml': No scheme supplied. Perhaps you meant https://None/laud-or-258_019.xml?
[...]
No match found for file OxfordLaudOr258_723.xml with filename "laud-or-258_723.jpg".
No match found for file OxfordLaudOr258_724.xml with filename "laud-or-258_724.jpg".
nathangibson commented 7 months ago

Thanks so much, and apologies that it took me a while before I saw your reply!

Thank you for this submission. Is it a work in progress or are you trying to submit it as is?

We would like to do more work on it but think it is already useful.

There are several problems which need to be fixed before the entry can be added to the catalog.

* you need to provide a list of authors

Done (see the above merges).

* it would be helpful to provide a more precise description of the dataset so that potential re-users can understand what is in the dataset, in particular since the images are not freely available for a portion of the dataset (if I understand well your documentation)

Will work on this -- basically explaining the image rights?

* are the transcriptions really spanning from 900 to 1900?

Yes, although there are only a few pages of the later material.

* I think the rules listed in your transcription convention could be pasted in the "transcription guidelines" field (but this is something I can fix).

Done.

That being said, my main issue is actually that I am not able to load the dataset in eScriptorium. I get the following errors when I do (see below), which might be caused by the fact that the value in "fileName" does not match the names of the image files. I tried on two instances of eScriptorium (v0.13.8b and v0.13.4b) with the same result. Did you try to import them in eScriptorium? Which version of eScriptorium did you use to export them? Did you generate them all in the command with Kraken? If yes, with which version?

Sorry, I think the issue was changing filenames after download, without realizing this would mess up the METS import. I've corrected this now. (e.g. https://github.com/biblia-arabica/academies/tree/main/htr/ground-truth)

Another main issue was that it wasn't so clear where the ground truth was. I've restructured to make this clearer. If you think https://github.com/biblia-arabica/academies/tree/main/htr/ground-truth is in order I will do the same for the other manuscripts. Thanks for your input!

PonteIneptique commented 2 months ago

@alix-tz Can you have a look ?

alix-tz commented 1 month ago

Ok, we are good now I believe!

I'm sorry @nathangibson that this took so long, I hadn't realized that you had updated the metadata because you didn't report the changes in the file attached to this PR. But that shouldn't have blocked us from moving on with adding the description in the catalog.

Thank you very much for your contribution!

nathangibson commented 1 month ago

@alix-tz Thanks for this! And apologies, I'm not familiar with the process so I didn't realize about the file attached to the PR. I appreciate your including us!