CRREL / GRiD-API

9 stars 3 forks source link

Allow developers to create docker-based tasks through the API #21

Open hobu opened 7 years ago

hobu commented 7 years ago

There are many situations where GRiD's API to ease fetching of data are simply not going to be good enough. The most common is that people want to run things over very large volumes of data when the pipes to transmit the data are not big enough. Consider some following scenarios:

1) Alpha organization wants to run an automated feature extraction algorithm, tuned with their own parameters and settings, over a city-sized region. At the moment, the only way for them to achieve that task is to fetch a city-sized region from GRiD, which is often in the 100s of GB in size, and then run it themselves on their own servers. Any immediacy requirements in the mix mean they end up replicating GRiD's entire data holdings to achieve these tasks rather than offload them to GRiD as intended.

2) Beta algorithm researcher wants to test variants of her algorithm on the same 125 GB patch. After a week of iterating using GRiD's API by continually pushing new iterations of the tool, the algorithm is verified to do what it says on the label, and the researcher now desires to allow other GRiD users to their algorithm over their own 100+ GB patches of data.

3) Gamma GRiD developer was tasked with integrating a one-off TDA for a small group of GRiD users.

I would like to propose the following additions to GRiD's APIs:

I would propose that our first implementation only support API use and consumption. API consumers would be on their own to manage the business logic of the contents of their JSON arguments. This pattern that I'm proposing seems rather obvious, and I wonder if there are existing implementations of it that are thought through with more care.

An implementation of this mechanism would benefit GRiD in a number of important ways. It would further enhance the capabilities of self-service consumers of GRiD data. It would make it even more convenient for the GRiD team to integrate typical "fetch data, do stuff to it, output data to user" TDA-like tasks that we are often asked to integrate. Finally, it would open up access to GRiD's access, task management, and cloud resources in a much wider way.

I know there are tons of gotchas here, but I'm interested in hearing credible technical arguments for or against reorienting some of our architecture to support this mode of operation.

chambbj commented 7 years ago

:+1: to providing this in some form.

FWIW, you can find some documentation on how DigitalGlobe does this with GBDX here. Not that you'd want to replicate exactly, but there may be some additional considerations.

chambbj commented 7 years ago

I think your third scenario is the one I'd always imagined. Although I don't think it has to be a "one-off TDA" or a "small group of GRiD users". I see this as just another means of allowing external developers to provide processing capabilities without requiring them to stand up their own external services. You are no longer forcing them into the PDAL box for integration into the GRiD export workflows either.

AlexMountain commented 7 years ago

What do you guys think about https://github.com/hydroshare/django_docker_processes ?

It looks like they've tackled a lot of the overhead in dealing with 3rd party docker containers and it leverages a similar architecture to what GRiD already has. It does appear to do a bit more than we need, but would you guys say it's similar to what we're looking for here?