URL format for linking to resources inside ZIP files

hubgit commented 10 years ago

I have a situation where some resources (e.g. a PDF file) are in the same directory as the datapackage.json file, and some other files (e.g. the source files used to generate the PDF file) are in a nearby ZIP archive.

I would like the resources section of the data package description file to describe resources contained in the ZIP archive, so that a preview can be shown without having to fetch and extract the ZIP file (which may be large) to get to a data package description file inside it.

To do this, the resource URLs need to link to locations inside the ZIP file. There are a few existing approaches to this, using URL fragments or similar:

http://example.org/package.zip#path=path/to/the/file: similar to the way individual pages of PDF files can be referenced, but using a path parameter name.
http://example.org/package.zip#path/to/the/file: as I can't think of any other use for a fragment identifier on a ZIP file URL other than to link to a path within it, this seems cleaner than using the path parameter. This is also how PHP's stream wrappers allow paths with ZIP files to be accessed.
jar:package.zip!path/to/the/file: the JAR URL scheme, which is said to work "for any ZIP based file".

Is there any preference for which of these formats should be used in the resources section of Data Package JSON?

rufuspollock commented 10 years ago

Great question and use case. I'd have to say I'd include to option 2, namely:

http://example.org/package.zip#path/to/the/file

But happy to hear any other ideas and comments.

@paulfitz @jpmckinney any thoughts?

jpmckinney commented 10 years ago

jar has the benefit of being an IANI-registered URI scheme.
To match PHP, the URI scheme would have to be zip not http.

However, it depends on whether or not we want "dumb" clients to just download the ZIP (multiple times, since they're ignoring the URI fragment) if they try to download all the resources in a package - in which case the scheme should be http - or if we want to support only "smart" clients that correctly parse the URI - in which case the jar or zip schemes should be used (preferably jar, since zip is just one programming language's opinion).

paulfitz commented 10 years ago

If I understand the use-case correctly, this feature requires server-side support, since the goal is to avoid fetching the entire zip file. In that case, a http scheme with whatever URL arrangement strikes the fancy of the hosting site seems the way to go. There's no need to agree on anything. The URI fragment proposal seems fine in that case, though maybe awkward on old-school servers where something like http://example.org/package.zip?path=path/to/the/file or http://example.org/package.zip/path/to/the/file might be easier to route (also, if the file in the zip happens to be a html file, you may want to reserve the fragment for its classical use to point to somewhere within that file).

jar: seems nifty but requires client support (so there will be tools that can't read your uris) and on a quick look I don't see any special magic that would avoid fetching the entire file. I could be entirely wrong on that of course, just basing it on the single line jar:<url>!/[<entry>] which feels like a go-fetch-from-web-then-rummage kind of thing.

hubgit commented 10 years ago

The server which serves the ZIP file doesn't need to understand this URL syntax, it just needs to return the ZIP file as normal (it'll never see the fragment identifier, in fact).

This syntax is to enable clients to extract the specified file from the ZIP file once it has been fetched.

jpmckinney commented 10 years ago

@paulfitz Indeed, the proposal is not that publishers should implement unzipping magic on their servers. The proposal is for readers of datapackage.json files to be able to intelligently extract the resources described in the package when those resources are within remote ZIP files.

If we expect those readers to be "smart", we can use a more appropriate, non-HTTP scheme. If we want to support dumb readers that throw any encountered URI at an HTTP client, then we should use an HTTP scheme.

hubgit commented 10 years ago

I can actually see that there could be problems with any of these approaches for clients that aren't expecting an unusual URL (either fetching the zip file multiple times and thinking that they've got the real file, or not understanding the jar protocol). Perhaps it would be best to use a different property, or even a separate section, distinct from resources?

jpmckinney commented 10 years ago

Fetching a ZIP multiple times is not bad. Thinking they've got the file when they've only got the ZIP could be bad, as this would be a silent failure; I think that's a point in favor of using the jar protocol, because then their client will blow up (loud failure) until they implement it; they can easily add exception handing in the meantime so that their client recovers and skips the problematic URI.

paulfitz commented 10 years ago

Thanks @hubgit @jpmckinney I think I get it now. I put my mental parantheses in the wrong place when parsing this:

so that a preview can be shown without having to fetch and extract the ZIP file (which may be large)

If fetching is ok but not extracting, then jar: would do the job or fail in an obvious way (just as important). Tools often return useful error messages when hitting an unrecognized scheme, and there is often a plugin mechanism for adding new ones.

paulfitz commented 10 years ago

One other possibility is a data-pipes solution, where a third party site like http://datapipes.okfnlabs.org/ gives you a uri that does the work of extracting what you want from the data source of interest. Then no need for smart clients or magic servers - but you end up reliant on a third party.

jpmckinney commented 10 years ago

I'd rather not introduce a dependency on an external service into datapackages. Publishers of individual datapackages can do what they want, but the recommendation should be to use jar (or whatever URI format we decide on), not to run all ZIP URIs through an external service like datapipes.

hubgit commented 10 years ago

Related: the pyremotezip package fetches a) the 64kb directory from the end of the zip file, then b) a specific file from inside the remote zip, without fetching the whole file (as long as the server supports Range requests).

Stiivi commented 10 years ago

I think that datapackage should not deal with data access at all. That is a very deep and dark rabbit hole we can fall into. Once we allow zip why not git? And then why not both git as git: and git+http? Once we are there, why not some public postgresql://host:port/database?table=TABLENAME?

I think that datapackage is just a metadata and it should stay a metadata. The rest should be kept to the the package readers/writers. They might as well as might not be able to read certain kinds of URLs, which is perfectly fine if documented.

I would love to see support for ZIP files myself, but I have no clear idea how would be the best way of implementing it before actually trying to work with couple dozens of such packages in the first place.

Please, don't make the standard more complicated before it is even a standard. Wait until people start using it and then gather the knowledge how it was used in various situations.

jpmckinney commented 10 years ago

@Stiivi I don't think it's a slippery slope like you describe. ZIP is an extremely common format, especially on open data catalogs. On several catalogs, it is the most common. I have not yet seen any open data catalog distribute datasets as Git or PostgreSQL. datapackage can add a special case for files contained in ZIP, and it need not support any other containers, for the simple reason that no others are as popular as ZIP.

With respect to standardization processes, typically, once something is a standard, it no longer changes for several years. That's because you can't have people standardize on a moving target (with a few exceptions for special cases like HTML - but datapackages has nowhere near the priority of HTML). A standard needs to be stable and unchanging over long periods of time in order for the standard to acquire strong implementation support. If we want to make changes to datapackages, those changes should be done prior to any standardization effort, not after.

Stiivi commented 10 years ago

Speaking about git - I would not underestimate it's importance for the future. And the postgres example was to show possible future ways of storing that ... replace that with any API that might provide data-set like objects. Even OKFN is building a git repository of master data... I see the "master data" as the most useful and the mostly reusable kind of open data to be wrapped under "data package" format.

Solving ZIP out of context of other possible stores will lead to a standard that will be hard to manage in the future, since ZIP is just another way of storing the data.

"open data catalogs" is just part of the data ecosystem. If we ignore the corporate data and the possibility of using the data package within corporations, we are ignoring quite huge amount of resources and blocking incentives for releasing organizational data as open data. You would not believe how useful something like "datapackage" might be for corporations internally.

Standard should be based on existing uses and their variance. We should let the idea generators and evolution do their work first. This feature should be postponed and solved in a more generic way, if we want it to be really useful and don't want to end-up with a compatibility and consistency mess in the future.

Stiivi commented 10 years ago

p.s.1: It has to be a moving target in the early stages, since our assumptions are based on our experience only. Unless we see experiences of others, it is pointless to standardize anything.

p.s.2: Data package has the potential to be as important for data as HTML is for documents.

jpmckinney commented 10 years ago

@Stiivi datapackages is not a standard. It's a working draft, and drafts are indeed moving targets. At some point, it will stop moving / it will be decided that it should stop moving, at which point it can work towards becoming a standard. I don't think we're at that point yet. That standardization process typically involves:

A broad call for feedback and changes to the draft as a result of that feedback.
Once there are no more substantive changes due to feedback, there is a call for implementations. There should be at least two independent implementations of each feature as a way to test the standard and to ensure its documentation is clear and not interpreted in different ways by different implementers.
If any features are not implemented, either they can be cut or more time can be taken to get implementations. Once every feature is implemented, the draft can be frozen as "version 1.0".

Right now, we're not even at step 1 of the standardization process (which is fine!). We're still in draft, as prominently noted at the top of the spec, developing the spec, figuring things out, testing assumptions, getting others' experience, etc.

sabas commented 10 years ago

7zip file manager uses a simple path to reference folders, so why not doing a url like

folder/archive.zip/internal_folder/file.txt

and leaving to parsers decide if it's into a zip file or not?

hubgit commented 10 years ago

@sabas If that was a URL, e.g. http://example.org/folder/archive.zip/internal_folder/file.txt, then any client trying to fetch it would get a '404 Not Found' error, as the actual URL of the containing zip file would be http://example.org/folder/archive.zip.

sabas commented 10 years ago

@hubgit uh, true that..

pwalsh commented 8 years ago

@rgrp is this worth opening up again?

roll commented 8 years ago

Sounds as a real over-complication for the spec implementations.

rufuspollock commented 8 years ago

WONTFIX. Closing as wontfix for now.

I think that:

This is perhaps more a part of the work on a "compression" model for data packages #132.
Generally: I echo some of @Stiivi concerns re complexity.
Probably a "pattern / FAQ": I can see the use case but think this is best as a "pattern" proposal for now. I'm marking it as a pattern of FAQ and though closing think this should be migrated to the pattern / FAQ section when that is up.

frictionlessdata / datapackage

URL format for linking to resources inside ZIP files #137