OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
179 stars 33 forks source link

Unclear: localsave proper configuration #260

Closed M3ssman closed 3 years ago

M3ssman commented 3 years ago

Issue description

In my interpretation of the regular standalone LAREX-setup, it is possible to configure a local path for storing the data by adjusting the larex.config. Thus, having localsave set to bookpath to do so, but no PAGE is actually written at the specified location, it wants to be downloaded if the SAVE RESULT- Button is pressed.

Steps to reproduce the issue

  1. Edit larex.config, set bookpath to a local path alike bookpath:/home/hartwig/Dokumente/larex-testdaten/books
  2. Start Webapp
  3. Confirm data is loaded from configured bookpath: ok
  4. Do something, like correct some text, press Enter, see text goes green: ok
  5. Presse SAVE RESULT

What's the expected result?

What's the actual result?

Plattform

maxnth commented 3 years ago

In my interpretation of the regular standalone LAREX-setup, it is possible to configure a local path for storing the data by adjusting the larex.config

This is correct and your described setup looks perfectly good as well.

Regarding the download popup: In case you haven't explicitly set websave:false in larex.config yet, this might fix it as the default value for it is true.

Regarding LAREX not writing the result XMLs to bookpath: Does LAREX throw any error (either in the frontend or backend)? Does the webapp / tomcat have write permissions for the bookpath?

If neither is the cause of the problem / leads to anything, could you try spinning up a LAREX docker container (e.g. https://github.com/maxnth/LAREX_Docker or the development container in the dev branch)? We had some weird behavior with >Java 1.8 in the past so this way we could rule something related to this out.

M3ssman commented 3 years ago

Thanks for your advises!

With setting websave:falsethe popup goes away. Now I find myself with a file called pixel.xml inside the book's path (I'm not quite sure if this file was really absent before), where the original PAGE file resides, still untouched. The pixel.xml has the contents I'd expect to be stored in the original PAGE, though.

There seems to be no Front- or Backend Errors, as far as I can see. Does the app contain Logging? Browser console just says:

request:/file/export/annotations - start communicator.js:11:30
request:/file/export/annotations - success communicator.js:13:13

when SAVE RESULT is being pressed

@docker I'll try later

maxnth commented 3 years ago

Now I find myself with a file called pixel.xml

Is it possible that the value for @imageFilename for the Page element inside the original XML file is e.g. pixel.png or something similar? LAREX currently takes the value at @imageFilename as the "real" basename.

Does the app contain Logging?

Backend logging is very barebone at the moment but everything available should get logged to wherever tomcat writes logs (CATALINA_HOME/logs or journalctl depending on the OS / setup).

M3ssman commented 3 years ago

catalina.out is very desolated at the moment, only prints something like LAREX context has been loaded/reloaded.

@pixel: You're absolutely right:

<Page imageFilename="pixel" imageHeight="10104" imageWidth="6814">

I wonder how this issue is rising. The original is an ALTO file from Tesseract doesn't contain any fileSourceInfo at all, before it was converted using git@github.com:UB-Mannheim/ocr-fileformat.git from alto to page.

M3ssman commented 3 years ago

You're really awesome! I renamed the imageFilename as expected, and then it works very well! Ok, it kicks off any Word elements, but that's how it's supposed to work, right?

I do really belive this sort of magical matching shall be documented. I can take care of this if you don't mind.

maxnth commented 3 years ago

Ok, it kicks off any Word elements, but that's how it's supposed to work, right?

Not really how it's supposed to work in the long run but sadly how it works in the current master :sweat_smile: But we're already working on changing this (see https://github.com/OCR4all/LAREX/issues/214 and the /refactor/PAGEXMLio branch).

I do really belive this sort of magical matching shall be documented. I can take care of this if you don't mind.

I wouldn't mind at all.

bertsky commented 3 years ago

I wonder how this issue is rising. The original is an ALTO file from Tesseract doesn't contain any fileSourceInfo at all, before it was converted using git@github.com:UB-Mannheim/ocr-fileformat.git from alto to page.

@M3ssman Tesseract currently does not (cannot) output a useful fileSourceInfo, because of the internal structure of its renderers. (It has to be able to cope with multi-page input like file lists and multi-page TIFF, so it throws the image filename away in between.) The PAGE conversion transform cannot reconstruct that.

M3ssman commented 3 years ago

@bertsky Right, so we head back to https://github.com/tesseract-ocr/tesseract/issues/2700 The issue itself seems to rise within PRIMALabs PageConverter, which is used internally by ocr-fileformat.

maxnth commented 3 years ago

Closed with #261