bioimage-io / collection

Maintains the resources displayed on bioimage.io (Successor to collection-bioimage-io)
https://bioimage-io.github.io/collection/
0 stars 1 forks source link

Architecture of backend "API" #10

Closed jmetz closed 2 months ago

jmetz commented 4 months ago

Overview: Present situation

At the moment we don't have an "API" for the backend; or at least it's very minimal.

A typical workflow from a client when submitting a new model is:

  1. Upload zip - to hypha-S3, via Imjoy library
  2. Notify CI - currently uses Netlify function and passes package-zip-url and resource_id.
  3. Poll status (as we don't have websockets): a. Query versions.json to get versions info b. Build url to get correct details.json and other resource-submission specific files using hard-coded format, eg "${resource_id}/staged/{version_number}/details.json c. Poll this url to update client with status info and other details.

A workflow for review would look similar, except that we will have to use netlify functions for the updates to be able to write updates to S3.

Problems with this solution

We will in principle eventually have more than one client for this (eg potentially web, dedicated desktop app, napari, fiji, etc etc).

This means that if we make even small "internal" changes to how the backend stores things, all clients need to propagate these changes or be broken.

Proposal

We should present a clean, versioned API (REST?) for these operations. This allows us to continue to make changes behind the scenes as we wish, and even when we do make API changes, if we use a versioned API (like many solid APIs!), we don't risk breaking thing.

We also should have a simple mechanism that will alert clients, so that we handle the times that we do decide on truly breaking changes well. This includes a generous transition period during which clients using the old API will get a "warning message" with each API request.

MVP Implementation

The simplest and quickest way to implement the above is to do so with netlify functions at the moment.

Note, that netlify isn't set up for proper REST API construction, but relatively trivial ways of achieving this exist, eg https://netlifpress.netlify.app/ (source: https://github.com/TravelingTechGuy/netlifpress )

oeway commented 4 months ago

I can see that there might be better options to make it better, and we should keep an eye on it.

How about we get the upload ready, deployed, and then we collect user/developer feedback before building the next version?

In general, I would prefer a user driven decision, not technical. For example, I am not sure the issue you mentioned is a real issue, upload only happens when the user want to publish the model, we can always redirect the user to our upload web portal. It's nice to have a programmatically way of upload models, but that can easily be part of bioimageio.core which we maintain. I don't see a future that every language or client want you implement their own upload, at least not before completely solve other things like, model training spec, pre/post processing, workflows, model packaging. Also consider that most of the training and model production happens in Python, so most likely, user just use our core library is good enough to enable programmatic upload.

We need to also separate the model file upload and status checking/review process. Even if Java, rust client want to implement upload(assuming they had tackled the model training part), they might just print an URL to the user to tell them to go to our status page. It's unlikely they want to implement the full status display and review, chat flow in native app, why not simply use the browser for it?

IMO, the actually different between netlify and s3 api is mental differences. S3 is also a large key-value store -- we need to design the s3 file path and json format and use that as public "API". There is no difference between a netlify endpoint which return a json status and a s3 file.

It's not an urgent decision, isn't? Since we can always create wrapper api, e.g on netlify, which simplify the process and make it more stable if we need to change things later. It's a bit like our bioengine model runner api which is a bit ugly at the moment, but serves as a more hackable prototyping interface, it give us time to collect feedbacks and design a proper API.

PS: just found out that s3 also supports select operations like a SQL databases for json and other files: https://github.com/minio/minio/blob/master/docs/select/README.md

jmetz commented 4 months ago

There are similarities between our S3 structure and and eg a REST API, yes, but there are also subtle differences:

For this last point, I am in the process of creating a netlify function to handle that for us, which essentially acts then as a "basic API", so that the client can actually find out which version it just submitted when it uploaded the file - ie the kind of thing that should have been handled by a standard API.

My rationale here is just coming up with slightly more future proof design at this point, so that we don't end up with a bigger technical debt down the line because of hastily made decisions earlier on.

oeway commented 4 months ago

Ok, I see your point now, let's dive a bit deeper.

My rationale here is just coming up with slightly more future proof design at this point, so that we don't end up with a bigger technical debt down the line because of hastily made decisions earlier on.

I agree, but let's explore whether we can avoid the netlify layer (keep in mind it's a private company which has no open source alternative, not like S3) but achieve the same, still future proof, I think we can.

Here is my point, whatever you can do inside the netlify function, do it inside our CI and precompute the content and save it as a file on S3. This approach won't work in general, but for our case, it's enough.

This is very much similar to a static site generator vs a dynamic web server, as you know that static site is far more scalable than dynamic ones.

For the steps below, the CI can just do it to produce a lastest.json file contains everything needed:

Query a versions.json file for the versions of a resource/model Determine the last submitted staged version from that data Then query the details.json from that

I don't see the advantage here to do it dynamically, if you an foresee the queries (and I believe we can), we can always compute it, we can produce all the files needed by the status, we can even build html/js/files and store inside the model status folder, so one can just visit that s3 link.

Authentication is out of discussion here, it's a public model status, the model is meant for public, so we don't need to do access control. The only access control part before file upload is done by hypha server, you can implement the hypha login logic and access control part, plus interaction with s3 inside netlify again, but what for? Why not contribute code to a hypha service instead to fix potential issues?

jmetz commented 4 months ago

That's a good point and a good idea for the "status querying" - I agree we should implement it at the upload stage.

For the second point that I mentioned (access control), I'm not sure I understand what you're suggesting? Is it that the review process should happen with Hypha as the "middle-man" to do the access control, and have Hypha then write into other S3s? I guess my confusion here might be because we're using two S3's (and Hypha connected to the first one - imjoy.io at the moment; we don't touch Hypha on EBI at the moment, just the S3)

As an offshoot discussion though, Netlify and Github are both private companies offering free services (with limits); perhaps Github would face more public outcry if they suddenly went fully-paid or made other drastic changes, but it's by no means safe from such changes, especially small changes that could have big impacts on us (reduction in free ci time etc). From that perspective we really should be using GitLab or Gitea or some other similar service, that offers similar limited free services as Github, but also open-source self-hostable options... supabase would be something similar for the "serverless functions" side of things (as well as a lot more) - they also have a free tier, paid, and self hosting.

FynnBe commented 4 months ago

great discussion! here my two cents "for the interested reader": I believe changing from GH to another deployment solution can be done quite easily at this point (logic mostly in python scripts, not in the gh wfs). We should keep an eye on keeping it that way! the current pytests showcase how the lifecycle of resources is controlled by the backoffice python package, rather than gh workflows.

Re precomputed requests: at this point I wouldn't precompute something that needs two requests just yet. sounds a bit like premature optimization. Maybe it turns out that no status page should just show 'latest', but rather we want to show an overview of all versions.. so we'd rather need an 'all versions' aggregate. I'm happy to add any such additional files if that makes front end development easier. To keep our DB clean we can also put those behind another prefix and change along with the UI.