PRImA-Research-Lab / prima-page-viewer

Java based viewer for PAGE XML files (layout + text content). Also supports ALTO XML, FineReader XML, and HOCR.
Apache License 2.0
34 stars 9 forks source link

Support absolute file paths in imageFilename #11

Closed mikegerber closed 4 years ago

mikegerber commented 4 years ago

Given this PAGE-XML with an absolute image filename:

<?xml version="1.0" encoding="UTF-8"?>
<PcGts xmlns:xsl="http://www.w3.org/1999/XSL/Transform#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
  <Metadata>
    <Creator>OCR-D/core 2.4.2</Creator>
    <Created>2020-03-05T16:35:21</Created>
    <LastChange>2020-03-05T16:35:21</LastChange>
    <MetadataItem type="processingStep" name="preprocessing/optimization/binarization" value="ocrd-olena-binarize">
      <Labels>
        <Label value="101" type="win-size"/>
        <Label value="sauvola-ms-split" type="impl"/>
        <Label value="0.34" type="k"/>
      </Labels>
    </MetadataItem>
  </Metadata>
  <Page imageFilename="/srv/digisam_images/sbb/PPN719671574/00000420.tif" imageWidth="1479" imageHeight="2232" type="content">
    <AlternativeImage filename="OCR-D-IMG-BIN/FILE_0420_OCR-D-IMG-BINPAGE-BIN_sauvola-ms-split.png" comments="binarized"/>
  </Page>
</PcGts>

PAGE Viewer does not open the image given in imageFilename. I also tested prepending file:// with the same result. It would be nice if absolute filenames were supported :)

(I also copied the file to foo.tif and changed the PAGE XML to imageFilename="foo.tif" – This worked.)

chris1010010 commented 4 years ago

Hi, the filename in the XML is intended for just the name (incl. file extension), not the full path. That's why it's not supported. I guess we could add a check for a full path anyway. But you do have the option to pass the image (or the folder to look in) as command line parameter.

mikegerber commented 4 years ago

I'm not sure why it's defined/supported as just the filename not the absolute/relative file path, but I can explain what my use case is:

For me, it's great if the PAGE file can just point to the correct image file so that the document opens right away in PAGE Viewer. I work with PAGE files and PAGE Viewer a lot, so this is a big time saver and greatly improves usability. In this case, the PAGE file points to one of the five million source images on our library's file system.

Relative paths work fine, e.g. "TEST/foo.tif".

(I know about --resolv-dir, but that doesn't help in this case.)

chris1010010 commented 4 years ago

I made a change. Have a look if it works (only tested on Windows)

mikegerber commented 4 years ago

Wonderful, version 1.4.04 works :) Thanks for the update!

I tested the Linux 64 bit version.