Closed hubgit closed 8 years ago
Great question and use case. I'd have to say I'd include to option 2, namely:
http://example.org/package.zip#path/to/the/file
But happy to hear any other ideas and comments.
@paulfitz @jpmckinney any thoughts?
zip
not http
.However, it depends on whether or not we want "dumb" clients to just download the ZIP (multiple times, since they're ignoring the URI fragment) if they try to download all the resources in a package - in which case the scheme should be http
- or if we want to support only "smart" clients that correctly parse the URI - in which case the jar
or zip
schemes should be used (preferably jar
, since zip
is just one programming language's opinion).
If I understand the use-case correctly, this feature requires server-side support, since the goal is to avoid fetching the entire zip file. In that case, a http
scheme with whatever URL arrangement strikes the fancy of the hosting site seems the way to go. There's no need to agree on anything. The URI fragment proposal seems fine in that case, though maybe awkward on old-school servers where something like http://example.org/package.zip?path=path/to/the/file
or http://example.org/package.zip/path/to/the/file
might be easier to route (also, if the file in the zip happens to be a html file, you may want to reserve the fragment for its classical use to point to somewhere within that file).
jar:
seems nifty but requires client support (so there will be tools that can't read your uris) and on a quick look I don't see any special magic that would avoid fetching the entire file. I could be entirely wrong on that of course, just basing it on the single line jar:<url>!/[<entry>]
which feels like a go-fetch-from-web-then-rummage kind of thing.
The server which serves the ZIP file doesn't need to understand this URL syntax, it just needs to return the ZIP file as normal (it'll never see the fragment identifier, in fact).
This syntax is to enable clients to extract the specified file from the ZIP file once it has been fetched.
@paulfitz Indeed, the proposal is not that publishers should implement unzipping magic on their servers. The proposal is for readers of datapackage.json files to be able to intelligently extract the resources described in the package when those resources are within remote ZIP files.
If we expect those readers to be "smart", we can use a more appropriate, non-HTTP scheme. If we want to support dumb readers that throw any encountered URI at an HTTP client, then we should use an HTTP scheme.
I can actually see that there could be problems with any of these approaches for clients that aren't expecting an unusual URL (either fetching the zip file multiple times and thinking that they've got the real file, or not understanding the jar protocol). Perhaps it would be best to use a different property, or even a separate section, distinct from resources
?
Fetching a ZIP multiple times is not bad. Thinking they've got the file when they've only got the ZIP could be bad, as this would be a silent failure; I think that's a point in favor of using the jar
protocol, because then their client will blow up (loud failure) until they implement it; they can easily add exception handing in the meantime so that their client recovers and skips the problematic URI.
Thanks @hubgit @jpmckinney I think I get it now. I put my mental parantheses in the wrong place when parsing this:
so that a preview can be shown without having to fetch and extract the ZIP file (which may be large)
If fetching is ok but not extracting, then jar:
would do the job or fail in an obvious way (just as important). Tools often return useful error messages when hitting an unrecognized scheme, and there is often a plugin mechanism for adding new ones.
One other possibility is a data-pipes solution, where a third party site like http://datapipes.okfnlabs.org/ gives you a uri that does the work of extracting what you want from the data source of interest. Then no need for smart clients or magic servers - but you end up reliant on a third party.
I'd rather not introduce a dependency on an external service into datapackages. Publishers of individual datapackages can do what they want, but the recommendation should be to use jar
(or whatever URI format we decide on), not to run all ZIP URIs through an external service like datapipes.
Related: the pyremotezip package fetches a) the 64kb directory from the end of the zip file, then b) a specific file from inside the remote zip, without fetching the whole file (as long as the server supports Range requests).
I think that datapackage should not deal with data access at all. That is a very deep and dark rabbit hole we can fall into. Once we allow zip
why not git
? And then why not both git as git:
and git+http
? Once we are there, why not some public postgresql://host:port/database?table=TABLENAME
?
I think that datapackage is just a metadata and it should stay a metadata. The rest should be kept to the the package readers/writers. They might as well as might not be able to read certain kinds of URLs, which is perfectly fine if documented.
I would love to see support for ZIP files myself, but I have no clear idea how would be the best way of implementing it before actually trying to work with couple dozens of such packages in the first place.
Please, don't make the standard more complicated before it is even a standard. Wait until people start using it and then gather the knowledge how it was used in various situations.
@Stiivi I don't think it's a slippery slope like you describe. ZIP is an extremely common format, especially on open data catalogs. On several catalogs, it is the most common. I have not yet seen any open data catalog distribute datasets as Git or PostgreSQL. datapackage can add a special case for files contained in ZIP, and it need not support any other containers, for the simple reason that no others are as popular as ZIP.
With respect to standardization processes, typically, once something is a standard, it no longer changes for several years. That's because you can't have people standardize on a moving target (with a few exceptions for special cases like HTML - but datapackages has nowhere near the priority of HTML). A standard needs to be stable and unchanging over long periods of time in order for the standard to acquire strong implementation support. If we want to make changes to datapackages, those changes should be done prior to any standardization effort, not after.
Speaking about git - I would not underestimate it's importance for the future. And the postgres example was to show possible future ways of storing that ... replace that with any API that might provide data-set like objects. Even OKFN is building a git repository of master data... I see the "master data" as the most useful and the mostly reusable kind of open data to be wrapped under "data package" format.
Solving ZIP out of context of other possible stores will lead to a standard that will be hard to manage in the future, since ZIP is just another way of storing the data.
"open data catalogs" is just part of the data ecosystem. If we ignore the corporate data and the possibility of using the data package within corporations, we are ignoring quite huge amount of resources and blocking incentives for releasing organizational data as open data. You would not believe how useful something like "datapackage" might be for corporations internally.
Standard should be based on existing uses and their variance. We should let the idea generators and evolution do their work first. This feature should be postponed and solved in a more generic way, if we want it to be really useful and don't want to end-up with a compatibility and consistency mess in the future.
p.s.1: It has to be a moving target in the early stages, since our assumptions are based on our experience only. Unless we see experiences of others, it is pointless to standardize anything.
p.s.2: Data package has the potential to be as important for data as HTML is for documents.
@Stiivi datapackages is not a standard. It's a working draft, and drafts are indeed moving targets. At some point, it will stop moving / it will be decided that it should stop moving, at which point it can work towards becoming a standard. I don't think we're at that point yet. That standardization process typically involves:
Right now, we're not even at step 1 of the standardization process (which is fine!). We're still in draft, as prominently noted at the top of the spec, developing the spec, figuring things out, testing assumptions, getting others' experience, etc.
7zip file manager uses a simple path to reference folders, so why not doing a url like
folder/archive.zip/internal_folder/file.txt
and leaving to parsers decide if it's into a zip file or not?
@sabas If that was a URL, e.g. http://example.org/folder/archive.zip/internal_folder/file.txt
, then any client trying to fetch it would get a '404 Not Found' error, as the actual URL of the containing zip file would be http://example.org/folder/archive.zip
.
@hubgit uh, true that..
@rgrp is this worth opening up again?
Sounds as a real over-complication for the spec implementations.
WONTFIX. Closing as wontfix for now.
I think that:
I have a situation where some resources (e.g. a PDF file) are in the same directory as the
datapackage.json
file, and some other files (e.g. the source files used to generate the PDF file) are in a nearby ZIP archive.I would like the resources section of the data package description file to describe resources contained in the ZIP archive, so that a preview can be shown without having to fetch and extract the ZIP file (which may be large) to get to a data package description file inside it.
To do this, the resource URLs need to link to locations inside the ZIP file. There are a few existing approaches to this, using URL fragments or similar:
http://example.org/package.zip#path=path/to/the/file
: similar to the way individual pages of PDF files can be referenced, but using apath
parameter name.http://example.org/package.zip#path/to/the/file
: as I can't think of any other use for a fragment identifier on a ZIP file URL other than to link to a path within it, this seems cleaner than using thepath
parameter. This is also how PHP's stream wrappers allow paths with ZIP files to be accessed.jar:package.zip!path/to/the/file
: the JAR URL scheme, which is said to work "for any ZIP based file".Is there any preference for which of these formats should be used in the resources section of Data Package JSON?