OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
177 stars 33 forks source link

MetsReader and ImageLoader: support remote URLs #329

Closed bertsky closed 9 months ago

bertsky commented 1 year ago

Since https://github.com/OCR4all/LAREX/commit/453ff15be0af23b3eb78823cd3e14efb438f5135, if the METS contains file references which are true URLs (not local file paths), then the library will not crash, but the respective book will be empty (and I cannot seem to leave such an empty book afterwards, I'll have to reload the page entirely).

It would be really helpful if LAREX was able to manage such remote files in a semi-transparent manner:

bertsky commented 1 year ago

Oh, and I should mention that it would make perfect sense to coordinate this with the upcoming OCR-D feature that multiple FLocat refs are allowed per file. This will enable keeping the original remote presentation links in addition to downloaded local paths (with sane file names), so after processing the temporary local refs can be converted back to public and removed.

So if LAREX supports URL refs, it should also already support ignoring such remote refs if local refs are additionally present.

bertsky commented 1 year ago

So if LAREX supports URL refs, it should also already support ignoring such remote refs if local refs are additionally present.

Ok, judging by the code, this should currently work already (i.e. http FLocats will be ignored).

I have also tested this successfully.

So nothing special needs to be done in LAREX after all – an external program could simply download all files of the required fileGrps and change them to local refs with sane file names (as ocrd workspace find --download does, but keeping the remote refs, as with mm-update).

What remains to be done is instruct users how to do so. (Currently, they'll simply be surprised to get an empty fileGrp list if everything is remote URLs.)

Should we leave this open as a documentation issue?

bertsky commented 1 year ago

Alas, it does not work on dev anymore!

If a file has a secondary remote FLocat, then it will not show up as page in the editor. (Despite the fact that it was activated in the library dialog.) So if all files are formatted this way, then no pages are shown.

My guess is that this change is responsible.

maxnth commented 1 year ago

My guess is that this change is responsible.

Argh that's annoying and shouldn't have happened. We'll try to find some time to look into this issue (and some of the others like #240 ) in the following days / weeks.

bertsky commented 1 year ago

Any news on this? I'd really like to switch to the newest dev version because of the other fixes, but this breaking change is a show-stopper for me.

maxnth commented 1 year ago

Still on our backlog (I promise), sadly still didn't get to it yet. Will update as soon as we find some time (hopefully sooner than later).

M3ssman commented 11 months ago

@maxnth Have you considered to use a dedicated component for METS-handling, like mets-model?

maxnth commented 11 months ago

My guess is that this change is responsible.

Finally got to looking into it, this indeed messed with loading annotations in METS projects. Starting from 70be72 this now works for me again (and I cautiously hope for other's as well, otherwise I'll look into it again) while also allowing loading annotations from files with certain special characters in the file name (which the "fix" above was intended to solve).

maxnth commented 9 months ago

I'm gonna mark this as fixed, in case I missed something feel free to reopen this issue.