YaleDHLab / pix-plot

A WebGL viewer for UMAP or TSNE-clustered images
MIT License
591 stars 139 forks source link

IIIF Import: First Test #174

Closed rodighiero closed 3 years ago

rodighiero commented 3 years ago

I started to play with IIIF import with a specimen of 1000 items, and I have one issue and one remark:

I include the URL list, all the records I checked seem correct. To be completely transparent, these are the oldest records of the HAM dataset, and it is possible that their accuracy might be approximate.

Happy to double-check if it is useful!

Thanks, Dario

IIIF1000.txt

duhaime commented 3 years ago

@rodighiero Many thanks for reaching out, and sorry it's taken us a little to get back to you!

The issue you have run into is interesting. It looks like some of these links present their images in a format we're not yet parsing...

To demystify what's going on here a little: If one provides a IIIF input to PixPlot, we process that file using iiif_downloader.Manifest( manifest_url ).save_images(limit=1). That function will just grab the first canvas in the specified manifest url, then grab the first sequence, then grab the first image, then fetch that image's resource.@id property to determine the image's url.

The catch is that some of the manifests in your sample file don't have any canvases inside of the sequences attribute. The first manifest in your file, for example, doesn't have any canvases. So the downloader has nothing to download!

This is what I'm confused about. The rendering.@id attribute in that manifest links to this page, which shows the image. The image that's displayed on that page is from url https://ids.lib.harvard.edu/ids/iiif/14103659/full/256,/0/default.jpg. But I can't tell how they get from that 1412 id to the image url.

Investigating the page source that renders the image, I see:

<div class="osd" id="osd_14103659" ...></div>

Is that transformation from id 1412 to 14103659 happening on a web server? Or is there a way to get to the latter id using just the manifest data?

Looking at the IIIF docs on the rendering attribute (I don't see docs for the rendering attribute in the 2.0 presentation API Harvard is using but only in the 3.0 documentation), it seems like the rendering property designates a link to an image that isn't rendered by IIF?

@jeffsteward or @rsinghal if either of you (or other team members!) might have a sense as to how the IIIF manifest discussed above resolves to the viewing page discussed above, I would be super grateful for any insights you can offer!

duhaime commented 3 years ago

@rodighiero I wanted to follow up briefly on your second note above as well. Parsing the metadata from the IIIF images is a really great idea! The catch we've encountered so far is the fact that those metadata fields are essentially schemaless--the IIIF spec places little restriction on the data that can be articulated within the metadata for an image, so it's a little hard to know how best to parse that metadata in a way that would be useful across various collections...

Right now our recommendation is that users parse those manifests and collect their metadata in CSV format for ingestion into a plot. I can imagine a world in which we fetch the first manifest in a list of manifests, examine the metadata keys associated with that manifest, and prompt the user to identify e.g. which of the metadata keys corresponds to a date so we can create a temporal layout of the images. We haven't built this out yet but are certainly open to pull requests if you think this sounds like a promising path forward?

jeffsteward commented 3 years ago

@duhaime I can answer questions about HAM's services. Hopefully this will make sense.

The HAM IIIF manifest service is merely a transformation of data in our primary API. Our primary API includes every bit of data about our collection and is used for the rendering of the pages linked to in rendering.@id. Our primary API has a thin layer of access control which means we allow access to rights restricted image URLs on HAM's public website but we strip those URLs out of our services for general users. Hence the reason you'll encounter manifests with no canvases.

duhaime commented 3 years ago

@jeffsteward Thanks very much for your follow up! This unlocks the mystery!

If a user like @rodighiero were to join a Harvard VPN, would they have access to those rights-restricted images via the IIIF canvases? Or is there another way for a community member up there to gain access to the full dataset via IIIF endpoints?

rodighiero commented 3 years ago

Hi @jeffsteward, I think that @duhaime identified the problem. There is no canvas for some of the images, which seems a problem just for the very first records. Can you confirm that?

Thank you both

duhaime commented 3 years ago

I think this is correct @rodighiero! My hunch is that if you were to meet with Jeff, he could help you gain access to those IIIF images on the server that require authentication.

@rodighiero Shall we close out this issue and reopen it if there are other issues with the IIIF processing?

rodighiero commented 3 years ago

Sure, thanks!