Serve Images directly from a ZIP file

DiegoPino commented 4 years ago

This is probably a weird use case. In our Digital repository we want to allow upload of large(ish) ZIP file containing a set of Image files but hopefully treating them as a package without expanding them (imagine a Book of 2000 pages). Accessing the file names from the ZIP file, e.g and providing a single image if requested (e.g in PHP) is way faster and simpler in terms of metadata and mobility of our data than our current alternative than extracting all Images into a local Folder and treating them as single, first level, media sources.

So wondering if, given a ZIP file and a filename, would writing a ZIP processor (would be like a middleware type of plugin almost) for Cantaloupe a very heavy task? Would the performance degrade heavily? I imagine something like the PDF page access, where you pass a page number, in this case it would be Filename instead. Cantaloupe would get on the fly the Image from the ZIP file and do its normal workflow. Probably dealing with a direct info.json response without any argument could be an issue.

Asking just in case that is something that i/we could implement without hacking in a bad way this wonderful system.

adolski commented 4 years ago

Hi @DiegoPino,

This is a weird use case for sure, but also interesting. Let me just think out loud about it.

The URI contains an identifier which locates the zip file. (We can infer that it's a zip file the same way we infer any other format, so that's not a problem.)

Additionally, we need to know the name of the file to extract from within the archive, and provide it to the Source somehow. Yeah, it could come from a query argument, but... I don't like query arguments. :smile: (page and time are only there to work around limitations of the Image API.)

Another possibility is that the filename is specified in the identifier, like archive.zip-image01.jpg. Then if you were using FilesystemSource, you could implement filesystemsource_pathname() to split on the hyphen in order to provide both parts to FilesystemSource. (Same idea for the other Sources.)

The Source now has both names. There are some classes built into the JDK for decompressing data from zip archives and I know there are third-party libraries for handling other archive formats. Then it would just be a matter of the Source creating the plumbing to read from this stream of decompressed data instead of directly from the file/object. None of the rest of the system would have to know that the image is stored in a zip.

All in all, it doesn't seem too bad.

One thing to be aware of is that when the image data resides in a compressed archive, it's not possible to seek around in it, and so there should be a severe performance penalty for image formats that would benefit from seeking, like JP2 or pyramidal+tiled TIFF. Maybe this could be avoided when using an uncompressed archive, depending on how those are structured (I don't know).

Finally, setting aside all of the above (and also the question of whether serving images out of zip files is considered good practice :smile:) there is the question about whether this is a common enough use case to justify supporting it.

DiegoPino commented 4 years ago

Hi @adolski thank you for such a fast and positive reply.

I totally agree with everything you write (also on this being a weird use case) but its also one that makes quite a lot of sense locally here for many reasons: smaller storage, simpler file/less files, simpler operations of related digital assets (book pages) but inclusive outside of our own Repository environment, e.g native compressed packages, like https://github.com/frictionlessdata/specs (datapackages), where things are deposited/stored as ZIPs always and could, eventually need to provide access to media.

Means, i also think passing extra arguments is not the best option, kinda breaks the formally defined API specs (i do like your PDF option, i feel its a necessity, we can live with that exception). Using a combined ID of ZIP file name + internal file name makes sense, its actually not so different to what we have been doing on our older repos and your server, where the ID we pass is a full encode URI. E.g In PHP you can access a file like

zip://archive.zip#dir/bigimage1.jp2

which under that notation speaks pretty well about how a file could be access from a single ID. From there, as you say, seems like the task could be delegated to either something in JDK or even ruby/any plumbing you feel is more advantageous.

Its a real use case and i would love to help implement this. It makes sense to us so why not? 😅

Not sure about the good practice part really. In our case, having ways of checking a whole package (checksumming, state changes in, e.g S3 storage policies, batch PRONOM processing and storing also inside the same extra metadata (in a manifest), makes things easier, specially in repositories where number of files can get to the few or more millions. It would be quite opaque to the end user, and your system has a pretty (quite perfect) good cache system, which in fact (i could have read code wrong) means a source cache could get around the problem of seeking, like in JP2/pyramidal+tiled TIFF? Means first time served it would cache the extracted file and serve from then on, for a period of time, from the cache, means direct access to seek the, right?

Hope someone else finds this use case interesting, please let me know if there is anything i could do to get this started, if there is enough interest of course.

Thanks again!

cantaloupe-project / cantaloupe

Serve Images directly from a ZIP file #345