Standards for downloading a patch to update a local copy of a dataset to latest

philandstuff commented 8 years ago

Are there any standards around bulk download and streaming of datasets? I'm thinking about git clone and git pull style operations for a dataset so I can download a dataset then, at a later time, fetch everything that's changed since.

torgo commented 8 years ago

Possibly Packaging on the Web could be applied here (although this is still in draft stage).

maxf commented 8 years ago

I don't think that Packaging on the Web covers the scenario where you only want to download what's changed (like diff files). It seems like a good idea but perhaps belongs in the HTTP space, using ETAGS like git uses commit hashes to compute diffs, etc.

philandstuff commented 8 years ago

my use-case is actually simpler: the data we're dealing with is already a stream of events and so you don't need to compute a diff; you just need to send the events from version M to version N.

maxf commented 8 years ago

Isn't that application-level functionality? Your server could offer query string params like: http://.../dataset?from=M&to=N

davidread commented 8 years ago

'Dat' is designed for incremental updates to keep your local copy in sync with the remote copy. There are some similarities with git, but designed for tabular data (CSV or JSON).

https://github.com/maxogden/dat

SteveMarshall commented 8 years ago

To the original example: the way git clone and git fetch work is by using pre-generated (binary) index files and the “dataset” (the trees and blobs) to make decisions about which discrete pieces of the dataset they need to download (by checking all the objects in both repositories are connected in the same way).

Pre-generated indexes and liberal walking of the dataset seem like a pretty portable pattern to me (and I've seen it in other places, like CPAN), but it's probably not something for a standard?

There appear to be three parts to this pattern:

Publishing the dataset itself.
Publish some sort of index(es) for the dataset.
Have a client that knows how to act on those things.

In the case you're describing, @philandstuff, the “index” is the list of changes to the dataset over time. That's a problem that Atom and RSS already (try to) solve. They may not be as efficient as a custom binary format, like that of git's indexes, but one of them is at least an IETF standard. (And if you wanted a binary format, you could always serialise to something like BSON or MessagePack.)

Neither RSS nor Atom natively offer “give me all changes from this point forth” (though neither do git's index and refs), and I think that's probably better handled by the client. It wouldn't be too complicated to put something together along the lines that @maxf suggests, though but maybe that should be an extension of the base functionality, rather than the core system?

(As an aside, I think calling this discussion “download and streaming” is misleading: what you're talking about is probably more clearly described as “replay of remote events”? “download and streaming” implies, to me at least, multiple ways to consume a single blob of data, either as a batch download or while it's being downloaded.)

philandstuff commented 8 years ago

Based on feedback here and elsewhere, I agree this isn't so much about "streaming" as it is about downloading a patch to get up to date, so I've updated the issue title to reflect this.

@SteveMarshall those are excellent thoughts. We've been considering either atom or query params (as @maxf suggests) or both. It's also good that you point out that part of this discussion is asking how much work is done by the client and how much by the server.

danielquinn commented 8 years ago

Well even if you want to leave the client doing the absolutely smallest amount of work, you'd still need a way for it to say: "I haz this much", which in the case of streaming data, typically means a time stamp or some other sequential id right?

So you're looking at a client that has to:

Get the last record you received and parse out the identifier in question
Hit the API at something like /path/to/ap?start=<aforementioned-identifier>

If the stream dies or is killed before you get everything, you just need to go back to step 1. Even in a setup as complex as git, you're still sending i haz this hash: <hash> (unless I'm totally wrong about this, I just use it, I didn't write it).

I suppose an alternative would be a server that keeps track of the client and how much data was already sent, so then the client would say i haz this session-id, which the server would link to a log of data sent and then go about returning the data as-yet-unsent. But this sounds like an error-prone nightmare to me.

The really complex stuff comes when the data you have isn't sequential so you need to say something like i haz these bits, gimme the rest, but from what I'm reading here, this isn't your situation.

SteveMarshall commented 8 years ago

@danielquinn Yeah, having the server keep track of things worries me: how long do you maintain that session? How do you allow clients with an expired session to continue? You end up storing lots of stateful data, having to persist that, and so on, and then having to solve the stateless problem anyway.

Thinking more about the general problem of getting only the bits of the index the client wants, one option might be to use the HTTP If-Modified-Since and Accept headers.

For example, a client might say “I have everything up to this date”, and the server can reply with a 304 Not Modified and no data if there's been no change, or (in the simplest mode of operation) 200 OK and the entire data set. That puts no burden on the server supporting dynamic responses, and is supported by lots of already-existing software using little more than file modification dates.

You could then extend that with something like (making things up using MIME Content-type parameters) Accept: application/atom+xml;changes-only=true if the client supports the give-me-just-the-changes mode of operation. To allow servers that don't support that mode to respond more usefully, the clients should probably actually send something like Accept: application/atom+xml;changes-only=true, application/atom+xml;q=0.4, then the servers can send them the full current dataset if they can't send a replay.

To my mind, that behaviour is pretty HTTP-native, idempotent, predictable, and supportive of all levels of client and server.

philandstuff commented 8 years ago

@SteveMarshall if you wanted to make it even more http-native, you could maybe use the http Range header.

SteveMarshall commented 8 years ago

@philandstuff Yeah! I always forget Range is useful for more than just bytes! :ok_hand:

philandstuff commented 8 years ago

@SteveMarshall except that no one has bothered registering any other units :( http://www.iana.org/assignments/http-parameters/http-parameters.xhtml#range-units

edent commented 7 years ago

In the absence of a defined user-need, I'm closing this. This could be an excellent candidate for a standard, if we can find the right use-case.

co-cddo / open-standards

Standards for downloading a patch to update a local copy of a dataset to latest #4