hannahhoward commented 2 years ago

Goals

Build a provider for https://github.com/application-research/autoretrieve/issues/96

Implementation

You can see the outlines of an SQL database structure in the API Spec for https://github.com/application-research/autoretrieve/issues/96

I recommend:

setup a simple SQL database
define a table for RetrievalAttempts
define a table for ProviderRetrieval where the UUID of the retrieval attempt is a foreign key

Model repo to work off for setting up SQL+HTTP API: Dealbot controller -- https://github.com/filecoin-project/dealbot/tree/main/controller

TBD/Implementer choice: Should we copy the dealbot's pattern of using serialized IPLD for any fields that aren't individually queryable / indexed? This is a nice way to effectively treat SQL like a document database.

I do think we should use bindnode instead of generated schemas though!

willscott commented 2 years ago

i am happy to have the api throw json, and have the sql side service just throw it into dynamo or bigquery or similar.

hannahhoward commented 2 years ago

@willscott should we just make something like this: https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-dynamo-db.html

willscott commented 2 years ago

pretty much, and the rows in dynamo can have dynamic schema as well, so we can just pass through the json from auto-retrieve directly

hannahhoward commented 2 years ago

yea alright I'll write it up that way

hannahhoward commented 2 years ago

seems super easy and avoids writing a lot of new code.

kylehuntsman commented 2 years ago

Just talked to Hannah about this and a couple concerns came up. My first concern is that Dynamo is not the best solution for a persistence store that is going to have data dynamically queried every which way as people want to view the data differently. Dynamo is limited to querying records with a given partition key, or ID, and some minor query-like-abilities on a sort key. General properties are not queryable unless using secondary indexes, in which case we'll be duplicating all of the data. Dynamo is a lot better of a solution when the queries are known ahead of time and do not change too much. My go to for this kind of ask would have been a PostgreSQL table. A relational table makes it much easier to query dynamically.

The other concern, and more or less just something we have to look into, is how infrastructure for this is managed by kubernetes. I have experience using Cloudformation, SAM, and Cloud Development Kit (CDK), but not the kubernetes integration. I don't know if this is a scenario where we can do both, or we must set stuff up via kubernetes, etc. CDK is awesome, but may just not work with our infrastructure management.

willscott commented 2 years ago

is there an alternative to google bigquery in the amazon stack? a postgres table also seems reasonable.
it's okay to have this as a separate endpoint that we don't manage in k8s from my perspective

kylehuntsman commented 2 years ago

We had another discussion around resources and model design after there was some concern around there being a few separate processes that took place over the course of a "retrieval". We've named these seperate processes a phase. events take place inside of phases. The immediate goal being to determine which providers are failing and succeeding.

API Endpoint: POST /retrieval-events
Content-Type: application/json
Request Body: {
   retrievalId:         UUID | The ID of the retrieval
   cid:               string | A queryable string of the CID being asked for
   storageProviderId: string | The peer ID of the storage provider
   phase:             string | Must be "(indexer related?)", "query", "retrieval" 
   phaseStartTime:    string | ISO 8601 UTC datetime string of the time this phase started
   event:             string | The name of the current event. Must be "connected", "query-asked", "proposed", "accepted", "first-byte-received", "phase-failed", "succeeded"
   eventTime:         string | ISO 8601 UTC datetime string of the time this event took place
   eventDetails:        json | Misc. JSON data related to the given event. TBD.
}
Responses:
   201: retrieval event was saved successfully
   Body: No body

   400: validation error
   Body: {
      error:                     string | The error code
      errorMessage:              string | The error message
      errorDetails: {[key:string]: any} | Any information related to this particular error
   }

   500: server error
   TBD

@hannahhoward @rvagg

I couldn't remember what the first phase we had discussed was.
I don't have an exact for how we want the event, failed, and success eventDetails to be structured. Feel free to make suggestions for those based on your needs.
I'm unsure as to all of the 500 level status codes that AWS might throw. Error responses can be automated by AWS.
We never discussed if we want authorization or authentication for this API. We could skip it initially and then add it later?

Feel free to make suggestions/comments/etc.

rvagg commented 2 years ago

Msg body looks good and matches my understanding of the discussion and what I'm working toward (nearly there).

phases: query and retrieval for now, later an indexer phase can be added and potentially more - we can hopefully remove query soon though but may want to save it as a placeholder (maybe we have paid retrieval back in the mix at some point?)
eventDetails for failure can be {"error":"msg"}, for retrieval success I'm just doing {"retrievedSize":x} for now, for query success I was had currently left it blank but we could add stuff later if we want details on that stuff (I suspect not if we're going to use the indexer to tell us who is doing free retrieval and therefore skipping query phase).
returns: I was just going to log non-success events, so for now don't worry too much about return messages and status codes, whatever the default is is fine for now.
authorization: good Q, but I vote for none for now just to get this up this week

rvagg commented 2 years ago

We forgot instanceId in here btw, I've added it to the events; we can call it autoretrieveInstanceId if you prefer

rvagg commented 2 years ago

As per my comment @ https://github.com/application-research/autoretrieve/pull/97#issuecomment-1190259407, here's an example line-delimited JSON event log, with 500 events generated by the current version of that code which you should be able to replay against your service instead of spinning up an instance (although that's not hard either): https://gist.github.com/rvagg/a9b941f1b423679659812fff1b9bf0a1

filecoin-project / tornado-deploy

Build a simple database + Rest API for autoretrieve recording #2

Goals

Implementation