cisocrgroup / ocrd-postcorrection

OCR-D profiler-based post-correction of historical OCR
MIT License
5 stars 2 forks source link

cannot run with relative path for METS #13

Open bertsky opened 7 months ago

bertsky commented 7 months ago

In OCR-D, long ago we moved away from absolute filenames and file:// refs in FLocat.

When calling de.lmu.cis.ocrd.cli.PostCorrectionCommand with an absolute path to the METS, it runs through, but produces output FLocats with absolute paths, which is (now) incorrect.

But when calling with just mets.xml inside the workspace directory, the postprocessor crashes:

22:32:03.614 DEBUG cis.PostCorrectionCommand - loading page
java.lang.NullPointerException
    at de.lmu.cis.ocrd.pagexml.METS$File.openLocalPath(METS.java:175)
    at de.lmu.cis.ocrd.pagexml.METS$File.openInputStream(METS.java:161)
    at de.lmu.cis.ocrd.pagexml.METSFileGroupReader.getPages(METSFileGroupReader.java:41)
    at de.lmu.cis.ocrd.pagexml.METSFileGroupReader.eachWord(METSFileGroupReader.java:54)
    at de.lmu.cis.ocrd.pagexml.METSFileGroupReader.getBaseOCRTokenReader(METSFileGroupReader.java:77)
    at de.lmu.cis.ocrd.pagexml.Workspace.getBaseOCRTokenReader(Workspace.java:33)
    at de.lmu.cis.ocrd.cli.ParametersCommand.getProfile(ParametersCommand.java:92)
    at de.lmu.cis.ocrd.cli.ParametersCommand.getProfile(ParametersCommand.java:61)
    at de.lmu.cis.ocrd.cli.PostCorrectionCommand.predictRankings(PostCorrectionCommand.java:96)
    at de.lmu.cis.ocrd.cli.PostCorrectionCommand.postCorrect(PostCorrectionCommand.java:61)
    at de.lmu.cis.ocrd.cli.PostCorrectionCommand.execute(PostCorrectionCommand.java:37)
    at de.lmu.cis.ocrd.cli.Main.run(Main.java:33)
    at de.lmu.cis.ocrd.cli.Main.main(Main.java:9)

The reason is simply that when opening input files via METS.File.openLocalPath, the first reference https://github.com/cisocrgroup/ocrd-postcorrection/blob/49decc4b9b2f38a16c49ff3b3be36a708a4d5077/src/main/java/de/lmu/cis/ocrd/pagexml/METS.java#L175 is null, because the file instance gets created in https://github.com/cisocrgroup/ocrd-postcorrection/blob/49decc4b9b2f38a16c49ff3b3be36a708a4d5077/src/main/java/de/lmu/cis/ocrd/pagexml/METS.java#L69 which expands to null for the parent of the relative path mets.xml.

So IMO the best fix would be to replace https://github.com/cisocrgroup/ocrd-postcorrection/blob/49decc4b9b2f38a16c49ff3b3be36a708a4d5077/src/main/java/de/lmu/cis/ocrd/pagexml/METS.java#L102 with the current working directory if workspace is indeed empty.

bertsky commented 7 months ago

Oh, the original problem seems to be not about absolute paths but the missing @LOCTYPE="OTHER" and @OTHERLOCTYPE="FILE".