Closed hannahhoward closed 2 years ago
i am happy to have the api throw json, and have the sql side service just throw it into dynamo or bigquery or similar.
@willscott should we just make something like this: https://docs.aws.amazon.com/apigateway/latest/developerguide/http-api-dynamo-db.html
pretty much, and the rows in dynamo can have dynamic schema as well, so we can just pass through the json from auto-retrieve directly
yea alright I'll write it up that way
seems super easy and avoids writing a lot of new code.
Just talked to Hannah about this and a couple concerns came up. My first concern is that Dynamo is not the best solution for a persistence store that is going to have data dynamically queried every which way as people want to view the data differently. Dynamo is limited to querying records with a given partition key, or ID, and some minor query-like-abilities on a sort key. General properties are not queryable unless using secondary indexes, in which case we'll be duplicating all of the data. Dynamo is a lot better of a solution when the queries are known ahead of time and do not change too much. My go to for this kind of ask would have been a PostgreSQL table. A relational table makes it much easier to query dynamically.
The other concern, and more or less just something we have to look into, is how infrastructure for this is managed by kubernetes. I have experience using Cloudformation, SAM, and Cloud Development Kit (CDK), but not the kubernetes integration. I don't know if this is a scenario where we can do both, or we must set stuff up via kubernetes, etc. CDK is awesome, but may just not work with our infrastructure management.
We had another discussion around resources and model design after there was some concern around there being a few separate processes that took place over the course of a "retrieval". We've named these seperate processes a phase
. event
s take place inside of phase
s. The immediate goal being to determine which providers are failing and succeeding.
API Endpoint: POST /retrieval-events
Content-Type: application/json
Request Body: {
retrievalId: UUID | The ID of the retrieval
cid: string | A queryable string of the CID being asked for
storageProviderId: string | The peer ID of the storage provider
phase: string | Must be "(indexer related?)", "query", "retrieval"
phaseStartTime: string | ISO 8601 UTC datetime string of the time this phase started
event: string | The name of the current event. Must be "connected", "query-asked", "proposed", "accepted", "first-byte-received", "phase-failed", "succeeded"
eventTime: string | ISO 8601 UTC datetime string of the time this event took place
eventDetails: json | Misc. JSON data related to the given event. TBD.
}
Responses:
201: retrieval event was saved successfully
Body: No body
400: validation error
Body: {
error: string | The error code
errorMessage: string | The error message
errorDetails: {[key:string]: any} | Any information related to this particular error
}
500: server error
TBD
@hannahhoward @rvagg
eventDetails
to be structured. Feel free to make suggestions for those based on your needs.Feel free to make suggestions/comments/etc.
Msg body looks good and matches my understanding of the discussion and what I'm working toward (nearly there).
query
and retrieval
for now, later an indexer
phase can be added and potentially more - we can hopefully remove query
soon though but may want to save it as a placeholder (maybe we have paid retrieval back in the mix at some point?)eventDetails
for failure can be {"error":"msg"}
, for retrieval
success I'm just doing {"retrievedSize":x}
for now, for query
success I was had currently left it blank but we could add stuff later if we want details on that stuff (I suspect not if we're going to use the indexer to tell us who is doing free retrieval and therefore skipping query phase).We forgot instanceId
in here btw, I've added it to the events; we can call it autoretrieveInstanceId
if you prefer
As per my comment @ https://github.com/application-research/autoretrieve/pull/97#issuecomment-1190259407, here's an example line-delimited JSON event log, with 500 events generated by the current version of that code which you should be able to replay against your service instead of spinning up an instance (although that's not hard either): https://gist.github.com/rvagg/a9b941f1b423679659812fff1b9bf0a1
Goals
Build a provider for https://github.com/application-research/autoretrieve/issues/96
Implementation
You can see the outlines of an SQL database structure in the API Spec for https://github.com/application-research/autoretrieve/issues/96
I recommend:
Model repo to work off for setting up SQL+HTTP API: Dealbot controller -- https://github.com/filecoin-project/dealbot/tree/main/controller
TBD/Implementer choice: Should we copy the dealbot's pattern of using serialized IPLD for any fields that aren't individually queryable / indexed? This is a nice way to effectively treat SQL like a document database.
I do think we should use bindnode instead of generated schemas though!