OCR4all / LAREX

A semi-automatic open-source tool for Layout Analysis and Region EXtraction on early printed books.
MIT License
179 stars 33 forks source link

existing segmentation does not load #280

Open bertsky opened 3 years ago

bertsky commented 3 years ago

I believe I found a regression in the current version (if compared to 0.6-RC1 27dc5bc): the page's existing PAGE-XML does not load, a brief warning appears (saying that segments could not be loaded), then LAREX autosegments. There is no related error in the logs / stdout.

(I did see an error message with the following stack-trace the other day, but it does not seem related, time-wise:)

``` 29-Aug-2021 05:58:45.789 INFO [http-nio-8080-exec-1] org.apache.coyote.http11.Http11Processor.service Error parsing HTTP request header Note: further occurrences of HTTP request parsing errors will be logged at DEBUG level. java.lang.IllegalArgumentException: Invalid character found in method name [0x030x000x00/*0xe00x000x000x000x000x00Cookie: ]. HTTP method names must be tokens at org.apache.coyote.http11.Http11InputBuffer.parseRequestLine(Http11InputBuffer.java:419) at org.apache.coyote.http11.Http11Processor.service(Http11Processor.java:269) at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:893) at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1723) at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) at java.lang.Thread.run(Thread.java:748) ```

chaddy314 commented 3 years ago

Does this happen in mets or legacy mode? If it happens in mets mode: in which fileGrp (with mimeType) does it happen?

bertsky commented 3 years ago

In legacy mode (library shows flat).

Sry, forgot to attach example data – here it is. 179-heizkostenabrechnung_01.06.2018-31.05.2019-page01.zip

chaddy314 commented 3 years ago

Due to the . in the filename Larex detects everything after it as a SubExtension and subsequently cuts everything off after that to determine the filename for its PAGE-XML.

A quick workaround for this problem would be to use a different character in dates (or changing the name of the old PAGE-XML to match Larex' expected format).

bertsky commented 3 years ago

Ah, many thanks – did not notice that crucial difference to all the other files (which were fine). Indeed, the workaround is trivial.

bertsky commented 3 years ago

Still, it would be great if LAREX was smarter on suffix detection. I often have files imported from PDF or multi-page TIFF which discerns pages via a NAME.PAGE.tif scheme...

bertsky commented 3 years ago

Still, it would be great if LAREX was smarter on suffix detection. I often have files imported from PDF or multi-page TIFF which discerns pages via a NAME.PAGE.tif scheme...

The old version already covered this case IIRC.

maxnth commented 3 years ago

Still, it would be great if LAREX was smarter on suffix detection.

I fully agree, it's more restrictive and "complex" than it needs to be, we'll definitely look into this.

bertsky commented 3 years ago

Oh, and to make matters worse: commas are not allowed either. Even in METS mode – the open dialog looks good but does not succeed, because the directory gets split along , and only the last part survives, which yields (for a path Nachrichten_aus_der_Bruder-Gemeine,_1819,_No._01 and fileGrp TEXT):

java.io.FileNotFoundException: /usr/local/tomcat/_No._01/TEXT/TEXT_0001.xml (No such file or directory)
        at java.io.FileInputStream.open0(Native Method)
        at java.io.FileInputStream.open(FileInputStream.java:195)
        at java.io.FileInputStream.<init>(FileInputStream.java:138)
        at java.io.FileInputStream.<init>(FileInputStream.java:93)
        at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
        at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
        at com.sun.org.apache.xerces.internal.impl.XMLEntityManager.setupCurrentEntity(XMLEntityManager.java:623)
        at com.sun.org.apache.xerces.internal.impl.XMLVersionDetector.determineDocVersion(XMLVersionDetector.java:148)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:806)
        at com.sun.org.apache.xerces.internal.parsers.XML11Configuration.parse(XML11Configuration.java:771)
        at com.sun.org.apache.xerces.internal.parsers.XMLParser.parse(XMLParser.java:141)
        at com.sun.org.apache.xerces.internal.parsers.DOMParser.parse(DOMParser.java:243)
        at com.sun.org.apache.xerces.internal.jaxp.DocumentBuilderImpl.parse(DocumentBuilderImpl.java:339)
        at javax.xml.parsers.DocumentBuilder.parse(DocumentBuilder.java:177)
        at de.uniwue.web.io.MetsReader.parseXML(MetsReader.java:95)
        at de.uniwue.web.io.MetsReader.getImagePathFromPage(MetsReader.java:102)
        at de.uniwue.web.controller.ViewerController.direct(ViewerController.java:119)
bertsky commented 2 years ago

Another effect of the additional dots/commas in the filenames besides the segmentation not loading (now fixed?) or the open dialog not concluding (commas in bookdir?) is that only the last page among each subset will show up (e.g. only *.0003.tif if you actually have *.0000.tif up to *.0003.tif).

bertsky commented 1 year ago

@maxnth low priority really? Goobi and Kitodo for example produce paths like AlbuRounC_1666480371_04150_tif/jpegs/00000001.tif.small.jpg all the time. These nested suffixes still break here.