bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
56 stars 48 forks source link

Add a transcoding queue for large transcoding jobs #125

Open danielballan opened 2 years ago

danielballan commented 2 years ago

For datasets up to some byte size, GET /array/full/some_array simply returns the data directly. For datasets above that size, we reach one or two limitations:

Therefore, we must transcode it to some persistence storage (e.g. POSIX filesystem, blob storage) and, later, spool it from there. It can be retained for some time to make subsequent requests go faster, as long as space allows.

The client-side experience will be much like when you download a directory of files from Google Drive. Google packs up the files into a zip archive and then, when that's complete, gives you a download link. The proposed flow is:

GET /array/full/very_large_array -> 202 Accepted Location: /queue/{job_id} GET /queue/{job_id} -> 200 OK (body contains stats like how long it has been running) Later... GET /queue/{job_id} -> 303 See Other Location: /download/{job_id} GET /download/{job_id} -> 200 OK (body is large data) Later... GET /download/{job_id} -> 410 Gone (artifact has been garbage collected)

informed by https://farazdagi.com/2014/rest-and-long-running-jobs/ which has disappeared but is still available at https://web.archive.org/web/20200511220257/https://farazdagi.com/2014/rest-and-long-running-jobs/.

We made a very rough prototype of this flow in https://github.com/bluesky/suitcase-server/blob/master/suitcase_server/handlers.py. It has a lot of problems and never got as far as being used but it may be a useful reference.

Requirements:

This is a prerequisite for #43 because trees are likely to be large --- i.e. "Download everything below this node as an HDF5 file".

dylanmcreynolds commented 2 years ago

I think this is great for a lot of reasons, including the ability to grab large data sets in a batch style mode to produce training sets for AI models...something that @kleinhenz has talked about.

dylanmcreynolds commented 2 years ago

For notifying the client that the package is ready, what about supporting something that can generate events and be consumed by https://developer.mozilla.org/en-US/docs/Web/API/EventSource?

danielballan commented 2 years ago

Oh, cool, I wasn't aware of that feature. At first glance, it looks like this would require separate client capability. For example, there are Python libraries for it but httpx and requests do not encompass it. I guess it also isn't something you could use from curl.

How would you feel about starting with the lo-fi 202 / 303 implementation and then adding the EventSource option in a follow-up?

dylanmcreynolds commented 2 years ago

Sure. The EventSource mechanism seems like a handy way to get one way notification from the server back to clients. The main difference seems to me that 202/303 requires the client poll regularly...EventSouce takes advantage of a long-lasting connection. In both cases, the client is going to have to do something in the background, which I suspect is the real new thing.

Again, the spec is new to me, but I suspect that we're going to want to have done EventSource if/when someone gets serious about developing browser-based apps that talk to Tiled.

danielballan commented 2 years ago

I agree. And I could be persuaded to flip the order. I guess I'm just advocating that we lay track for doing both, because I think will be eventually want both.

dylanmcreynolds commented 2 years ago

I agree with that. And on both the client and server, I'll bet that the two implementations share a lot of track.

danielballan commented 2 years ago

Circling back to think about this almost a year later…

danielballan commented 2 years ago

Also, at some point we discussed reusing artifacts across repeated requests but this is also complicated by permission—the way different users have different views into a node. We would also have to handle the possibility of mutation/modification.

Each asset should be bound to a specific job to start.

danielballan commented 2 years ago

Summarizing discussion with @dylanmcreynolds