Page count for TIF/PDF in request for info.json

cantaloupe-project / cantaloupe

High-performance dynamic image server in Java

https://cantaloupe-project.github.io/

Other

266 stars 107 forks source link

Page count for TIF/PDF in request for info.json #344

Closed barser closed 3 years ago

barser commented 4 years ago

Hello, team!

Please consider adding additional section into info.json with page count for multi-page formats such as TIF/PDF.

I believe it could be very useful in some scenarios.

Thank you!

adolski commented 4 years ago

Hi @barser,

Most of the information in info.json is page-specific. Different pages may have different dimensions, tile sizes, etc. So, adding a pageCount key wouldn't really be "correct."

I think it would be OK to add the page count to the delegate object context. You could then expose it in info.json using the extra_iiif_information_response_keys() delegate method. I'd worry about clients becoming dependent on that, though.

For a more correct solution, maybe Cantaloupe could do something like what it does with scale constraints and recognize a special suffix on the identifier to indicate a page number, instead of using a page query argument. That wouldn't give you the page count, but it would ensure that all of the pages' info.jsons are correct. The page count could be exposed as described above.

But what does a client do with pageCount? It would need to know that the identifier can be manipulated in a certain way to get a different page. Maybe instead there should be links to all of the other pages in every page's info.json.

I would be interested to know if IIIF has offered guidance on this use case. I encourage you and others with the same use case to voice it to the designers of the Image API.

cmhdave commented 4 years ago

In IIIF 3.0 there is a partOf linking property which I presume would link to a Presentation API manifest which contains all the pages of the TIF/PDF. I assume that is the direction this would take (?)

https://iiif.io/api/image/3.0/#58-linking-properties

DiegoPino commented 4 years ago

@adolski since we assume Cantaloupe deals with only the Image API, which is great and more than good enough. So for the PDF use case (and the cool fact we can request pages for PDFs via this server) we use pdfinfo to extract the number of Pages upfront for the file served and store them side by side in our metadata. Using that we build our IIIF Presentation API Manifest, appending to the same file name the page number as argument. Its quite simple really, pretty sure most people using Cantaloupe's PDF capabilities do something similar.

mitring commented 4 years ago

Hi @adolski,

For a more correct solution, maybe Cantaloupe could ... recognize a special suffix on the identifier to indicate a page number, instead of using a page query argument

It would be very useful! Moreover, such compound identifier will not violate the IIIF Image API identifier specification unlike the page GET parameter do.

If I place link https://{server}/iiif/2/{id}?page={n} to the page n of PDF with ID id in the IIIF Presentation API manifest, then viewers like Mirador or UniversalViewer generate requests for metadata of page like GET https://{server}/iiif/2/{id}?page={n}/info.json and for tiles of page like GET https://{server}/iiif/2/{id}?page={n}/full/,165/0/default.jpg, which is completely wrong.

But if the link to the page n of PDF with ID id would be look like, for example, https://{server}/iiif/2/{id}_p{n} or https://{server}/iiif/2/{id}_page_{n}, then it will satisfy IIIF Image API and viewers will generate correct links.

mitring commented 4 years ago

Hi @DiegoPino,

... we build our IIIF Presentation API Manifest, appending to the same file name the page number as argument

Could you share example of your Presentation API manifest please? When I create manifest with page numbers as GET parameters, then viewers can't correctly process the links - see example in this post.

giancarlobi commented 4 years ago

@mitring I'm not skilled as @DiegoPino is but probably our Archipelago presentation API works due to including into resource id the page number GET as this:

"canvases": [

        {
    "@id": "http://archipelago.byterfly.eu/node/29/iiif/b14b588e-c335-4df7-ae6d-3ba2a831c714/canvas/p1",
    "@type": "sc:Canvas",
    "label": "p. 1",
    "width": 3,
    "height": 4,
    "images": [{
    "@type": "oa:Annotation",
    "motivation": "sc:painting",
    "resource":{
      "@id": "http://archipelago.byterfly.eu/iiif-server/iiif/2/9d8%2Fapplication-conf16-selectedpapers-11-ceregato-et-al-b14b588e-c335-4df7-ae6d-3ba2a831c714.pdf/full/full/0/default.jpg?page=1",
      "@type": "dctypes:Image",
      "format": "image/jpeg",

mitring commented 4 years ago

Hi @giancarlobi,

Thank you for your reply! But in your example you specify link to concrete full-size image in @id attribute. I don't claim that it's wrong, but I want to specify link from which the Image API-compatible viewers could derive links to different tiles, possibly rotated or greyscaled. For example:

{
    "@id": "http://{server}/iiif/2/2722/canvas/page09",
    "@type": "sc:Canvas",
    "label": "Page 9",
    "width": 1240,
    "height": 1754,
    "images": [
        {
        "@id": "http://{server}/iiif/2/2722/annotation/page09",
        "@type": "oa:Annotation",
            "motivation": "sc:painting",
        "resource": {
                "@id": "http://{server}/iiif/2/2722?page=9",
        "@type": "dctypes:Image",
        "format": "image/jpeg",
            "width": 1240,
        "height": 1754,
        "service": {
            "@context": "http://iiif.io/api/image/2/context.json",
            "@id": "http://{server}/iiif/2/2722?page=9",
            "profile": "http://iiif.io/api/image/2/level2.json"
        }
            },
        "on": "http://{server}/iiif/2/2722/canvas/page09"
    }
    ]
}

And in such case page parameter breaks the compatibility with IIIF Image API specification.

giancarlobi commented 4 years ago

@mitring you are right, maybe @DiegoPino has some more notes to add. IMHO I think that your idea for a suffix style {id}page{n} would be very useful as also @adolski reported in his post.

DiegoPino commented 4 years ago

@mitring sorry, late to the party here. Time zone difference!

As @giancarlobi correctly was saying, we do generate the IIIF manifests for PDFs via this property but as you have clearly detected too, API Specs on one side but also each pretty liberal interpretation of each on the client side (or viewers) make this quite complex to process. I can not remember where in the specs (if you can point me to it it would be great), it says that URL arguments are not allowed. I was in the impression that given the original nature of the manifests (in v 2.x it was clearly JSON-LD, now in 3.0 more a 'depends on you how to interpret it, pure json or not), anything @id just needs to be a valid IRI. One problem was that a few versions ago (fixed now by @adolski!) GET arguments where not passed to the info.json, so page increments would not deliver a new size and actual image URL with every ?page argument change. That is fixed in the latest version in the 4.1.x series here.

For that reason and others (client, viewers, each one doing things differently) we decided to build different type of dynamic generated iiif manifests (v 3) depending on the need and in specific, the PDF one that uses the page arguments is serving images directly without a service definition to avoid this whole problem, so still spec compliant but yes, no black and white or rotation possible.

But, that said, we have another ongoing discussion with some boiler plate code that is specific to our needs but can be applied to any local solution really. An URL wrapper logic around this to make API client happy and of course also Cantaloupe. Here is the comment NOTE: see in the same issue discussion also how Mirador 3 has fixed the lack of static image support!

Basically we (i) have planned for a few proxy endpoint/URLS that wrap cantaloupe ones and do exactly what @adolski suggests, move arguments into IDs, and then locally those are split,processed and internal call to cantaloupe is made and the resulting JSON altered to get a correct, capable for your need, info.json.

I know it sounds like a hack, but on the other side gives (or said different would give when done) it allows us to have more control over this and other arguments we could need to pass into the ID.

I feel a good way of doing almost the same directly on cantaloupe if you don't want to have your own proxy pre processing of cantaloupe endpoints would be to allow in cantaloupe a request preprocessor using the same delegates system/ruby processing a way of processing/splitting ids before the actual call is made and allowing then from inside the delegate call then cantaloupe again. Not sure if i explain myself, like a pre handler for the request. That way the id and how its formed (with an extra ?page or whatever you want) can be customized by each implementation and then routed back to a normal Cantaloupe endpoint (e.g with an ?page at the end of the URI)

Side note: i feel there is a larger issue in how the specs expect propertoes like size (with/height) to be always there v/s the fact that they also depend on the info.json/service, given the fact that the later can provide those/proportions. That already makes our dynamic IIIF manifest generation quite processing heavy and myself not happy.

mitring commented 4 years ago

Hi @DiegoPino,

Thanks for your detailed answer! Here is my "five pennies" on some statements.

I can not remember where in the specs (if you can point me to it it would be great), it says that URL arguments are not allowed

There is no strict prohibition, but it's clear from the context of Chapter 2 of IIIF Image API. Here are some examples:

The IIIF Image API can be called in two ways ... Both convey the request’s information in the path segments of the URI, rather than as query parameters.

... image’s base URI ... constructed according to the following URI Template: {scheme}://{server}{/prefix}/{identifier}

The IIIF Image API URI for requesting an image MUST conform to the following URI Template: {scheme}://{server}{/prefix}/{identifier}/{region}/{size}/{rotation}/{quality}.{format}

The URI for requesting image information MUST conform to the following URI Template: {scheme}://{server}{/prefix}/{identifier}/info.json.

So there is simply no place for GET parameters, and URI like http://{server}/iiif/2/{id}?page={n} is treated by viewers (and Image API specs) as link to image with identifier {id}?page={n}. So viewers according to Image API specs correctly try to retrieve information about the image with GET http://{server}/iiif/2/{id}?page={n}/info.json request , and fails.

One problem was that a few versions ago (fixed now by @adolski!) GET arguments where not passed to the info.json, so page increments would not deliver a new size and actual image URL with every ?page argument change. That is fixed in the latest version in the 4.1.x series here.

Yes, I also noticed that the page argument does not affect the content of info.json, but that fix in 4.1.6 version, which isn't released yet.

An URL wrapper logic around this to make API client happy and of course also Cantaloupe ... Basically we (i) have planned for a few proxy endpoint/URLS that wrap cantaloupe ones and do exactly what @adolski suggests ... I know it sounds like a hack ...

I am also working in this direction now, trying to setup URL rewriting on Nginx that proxies requests to Cantaloupe. But you are right, this is a hack :)

Not sure if i explain myself, like a pre handler for the request.

I got the idea, thank you. It sounds cool, moreover - we already have something like that: ScriptLookupStrategy for sources. For example, S3Source with ScriptLookupStrategy converts id from URI to bucket and object key in S3 storage. If we could add information about page number in result of converter's method call, then issue would be resolved.

DiegoPino commented 4 years ago

Hi @mitring, thanks. Yes its pretty much the same use case we have.

I feel this statement here, which is the one that really is complicating the issue, is wrong in terms of how and URI, arguments and protocol work (RFC specs):

So there is simply no place for GET parameters, and URI like http://{server}/iiif/2/{id}?page={n} is treated by viewers (and Image API specs) as link to image with identifier {id}?page={n}. So viewers according to Image API specs correctly try to retrieve information about the image with GET http://{server}/iiif/2/{id}?page={n}/info.json request , and fails.

My take is that ?Page is a GET argument and can not/should never be made part of the ID. the ID is part of the path and it processed via a pattern. Even in cases where you have servers setup (like we do in PHP) to convert GET arguments into slash separated path segments/parts, that last everything after the ? should either processed differently or worst case, discarded. Webservers do that, NGNIX will do that, even JS would do that, why would a spec not do that? What i say is that viewers are getting this wrong or the SPEC is not explicit enough

I got the idea, thank you. It sounds cool, moreover - we already have something like that: ScriptLookupStrategy for sources. For example, S3Source with ScriptLookupStrategy converts id from URI to bucket and object key in S3 storage. If we could add information about page number in result of converter's method call, then issue would be resolved.

Yes. I agree. I wish there could be other options, but could be complex to enforce in simple code implementations (where calling a URI excels). Like, Instead of using get arguments we could use HTTP HEADERS but then there is no way you can pass headers from a Manifest! Another issues of just a URI based document which also means we need to be able to use GET. We have use headers many times when needing backend authentication to retrieve images from its source but never from inside a manifest of course.

My conclusion is: its a little bit complex to demand this change here and not even sure i could make a point (like asking please!) without also asking Client writers/viewer implementers and IIIF API specs committee to clarify what space GET has in their API.

cmhdave commented 4 years ago

Doing a little research into this particular issue. If I may, I would like to suggest following the kind of naming convention defined for URN: https://en.wikipedia.org/wiki/Uniform_Resource_Name

It is defined as urn:<NID>:<NSS>. Not necessarily prefixing it with urn: and in our case <NID>. It doesn't necessarily make sense. But the <NSS> is our identifier, with a sub-delims separating the page number. Defined sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=". The two that make the most sense to me are , or ;.
An identifier could be /iiif/2/item.pdf,3/full/full/0/default.jpg or /iiif/2/item.pdf;3/full/full/0/default.jpg

Thoughts?

hrvoj3e commented 4 years ago

@mitring

And in such case page parameter breaks the compatibility with IIIF Image API specification.

My solution was to use Level0 support in viewer (Mirador3 in my case). Resource without service - does not break IIIF api compatibility.

{
  "@id": "https:\/\/my.server\/ein3ft\/ri-138226\/canvas-st16410385-page-3",
  "@type": "sc:Canvas",
  "height": 1600,
  "width": 1600,
  "label": "compressed.tracemonkey-pldi-09.pdf",
  "thumbnail": {
    "@id": "https:\/\/my.server\/image416\/iiif\/2\/fgfg5h4f%2Fmain%2Fr%2Fop%2F71u%2Frop71u2n42gj.pdf\/full\/,100\/0\/default.jpg?page=3"
  },
  "images": [
    {
      "@type": "oa:Annotation",
      "motivation": "sc:painting",
      "on": "https:\/\/my.server\/ein3ft\/ri-138226\/canvas-st16410385-page-3",
      "resource": {
        "@id": "https:\/\/my.server\/image416\/iiif\/2\/fgfg5h4f%2Fmain%2Fr%2Fop%2F71u%2Frop71u2n42gj.pdf\/full\/full\/0\/default.jpg?page=3",
        "@type": "dctypes:Image",
        "format": "pdf",
        "height": 1600,
        "width": 1600
      }
    }
  ]
}

cmhdave commented 3 years ago

I'm tinkering with something for this in my fork that will allow a user to set a property called page_number.delimiter in the cantaloupe.properties file. If they set this property, the PublicResource abstract class will look for the page number in the Identifier URI path component rather than from a query string parameter. The goal is to allow you to do something like this:

page_number.delimiter = ;p

And you would reference page numbers in a PDF via this format: /iiif/2/filename.pdf;p12/full/full/0/default.jpg

If you leave the property in cantaloupe.properties blank, it works just as it does today with ?page=12

I still have some test failures to work through and I want to be able to test more different scenarios but I think it will work. I based most of the code on the ScaleConstraint code which works pretty much the same. It should work just fine with both settings.

adolski commented 3 years ago

@cmhdave, it looks like we started working on this around the same time. :smile:

The identifier path component needs to support three things, currently: an identifier, a page number, and/or a scale constraint, and the image server needs to be able to transform it not only from its component parts, but also to them (in order to support generating URIs).

Version 4.1 already supports a scale constraint suffixed to an identifier. I'm thinking that I will phase out the "suffix" terminology and replace it with the concept of a "meta-identifier" which consists of those components. So, it can be said that the "identifier path component" may contain either an identifier or a meta-identifier.

As for how a meta-identifier is formatted, there are two main options, configurable via a meta_identifier.transformer key:

The StandardMetaIdentifierTransformer suffixes a page number and/or scale constraint to the identifier similar to how the scale constraint works now. This transformer supports a meta_identifier.transformer.StandardMetaIdentifierTransformer.delimiter configuration key, with which the delimiter/separator is configurable. By default, the meta-identifier of page 3 of a PDF would look like: document.pdf;3 (props to @cmhdave for the idea of URN-compliant identifiers)
- I also like @cmhdave's idea of little prefixes to the non-identifier components (like file.pdf;p3;s1:2) in case any more components come up in the future that would introduce ambiguity. I haven't implemented that yet, though.
The DelegateMetaIdentifierTransformer enables full control over the transformation via two new delegate methods: deserialize_meta_identifier(String) and serialize_meta_identifier(Hash<String,Object>). There is also a new page_number key in the delegate context to accompany the identifier and scale_constraint keys that were already there. This transformer is sort of based on @DiegoPino's idea above.

That is the meat of it, I think. I'm open to feedback on this approach. I tried to come up with a solution that is simple out-of-the-box but offers precise control when needed.

cmhdave commented 3 years ago

I'm honestly happy that you are tackling this @adolski. Even though what I have is working I wasn't confident that I didn't break anything with the scale constraint. (The tests I wrote with the different variations passed but who knows what I might have broken in a live environment). That and I couldn't shake the feeling that the way I was doing it was kind of "hacky" and yours sounds like a more robust solution. I look forward to trying yours out!

adolski commented 3 years ago

The meta-identifier feature is on develop now. I hope it works. 😰 Good luck!

No worries @cmhdave, I felt the same way as I was working on this. The "scale constraint suffix" stuff was hacky to begin with. Hopefully what's in place now is a little bit better. Also, in the end, it was a lot more work than I thought it would be.

adolski commented 3 years ago

This thread was originally talking about a page count in information responses. I don't want to add that by default, but there is now a page_count key in the delegate context. You can implement extra_iiif_information_response_keys() and do whatever you want with it.

I'm going to close this issue as I think it's done now, more or less, but feel free to reopen if you find otherwise.