POST | GET schema's to join to releases

RickMoynihan commented 1 year ago

We should review the scratch schema code and integrate it as rest API routes.

Ultimately we may want to support many schemas being associated with a single release. However in this iteration of the prototype we can assume just one.

Multiple schemas may not be necessary; but they may support more cleanly allowing people to incrementally add commitments, or layer on other concerns (e.g. URI generation) to existing datasets (providing all data within that release historically stills conforms to the extra schema).

Implement POST / REDIRECT / GET pattern for creating schemas.

Route:

POST /data/:series-slug/release/:release-slug/schemas Accept: json+ld

BODY
{"dh:columns" [{"csvw:datatype" "string"
                "csvw:name" "foo_bar"
                "csvw:titles" ["Foo Bar"]}]}

Note normalising the incoming request body should:

Add "@type" "dh:TableSchema" as a managed param to the table schema
Add "@type" "dh:DimensionColumn" to each column
Add "datahost:appliesToRelease" "</data/:series-slug/release/:release-slug>"
Add "appropriate-csvw:modeling-of-dialect" "UTF-8,RFC4180"

Server responds: 303 /data/:series-slug/release/:release-slug/schemas/:auto-incrementing-schema-id

Route:

GET /data/:series-slug/release/:release-slug/schemas Accept: json+ld

Note the release will also get the inverse triple:

</data/:series-slug/release/:release-slug> datahost:hasSchema </data/:series-slug/release/:releases-slug/schemas/:schema-id>

kiramclean commented 1 year ago

Just to be sure: You can't update a schema, only create one? So in this prototype the idea is we only allow one schema to be associated with a single release. And schemas apply to all future releases going forward, until another schema is added? i.e. there is only ever one schema in effect and it is the latest one? Can you backtrack on a schema? e.g. say there is a mistake in a schema, can you publish a new one to supersede the old one with the correction?

RickMoynihan commented 1 year ago

I think we can build the prototype so that you can replace a schema.

However we can argue that the only reason you can do that, is because when we're making all these changes through the API we're operating in one logical draftset (it just doesn't exist yet). We can argue that when we've integrated with drafter and draftsets immutability would only be enforced "post publication".

Post publication you can't backtrack on a schema.

We also need to preserve invariants like this:

A user uploads a schema
A user puts some data into the release that conform to schema (we wouldn't let them publish data if it didn't)
A user deletes the schema (still ok as pre-publication). NOTE this implies schemas are optional whilst in a draft at least, and that you should allow data in a release without a schema. We could choose to fail publishes if releases don't contain schema, or we could allow schemaless data (and display a warning to consumers). Either would be fine, as it would still be enforcing only growing change on users.
At this point the data is schemaless but still exists
User uploads a new schema (Note also steps 3-5 occur during a PUT as it's just a DELETE and a PUT), at this point we MUST revalidate all the data in the release (including delete commits) against the schema. If and only if the data is valid to that schema should we allow that PUT.

Hope that makes sense.

rosado commented 1 year ago

Did some thinking about this, and while I can just write something that mostly matches the description/comments so far, I'm having doubts as that's going to be the best direction (we can discuss on ONS catchup call).

In short: as a fallout of some other ideas, it's probably not the best idea to tie the schemas to releases in the way it is presented above. If we made them standalone entities, that would open the possibility of re-using schemas by other releases (how, if at all, that's presented in the UI is of course out of scope of this project, but at least the option would be there).

rosado commented 1 year ago

The way releases and schemas should work, as outlined in this issue is OK at first reading, but the more you think of implementation, the more weaknesses you can see.

So here's how it's supposed to work according to the GH issue:

Create a release
Add a schema(s) to that release

Here are the issues with this approach:

Releases are supposed to be tied to schema changes. schema change <==> new release. So we create a new release (and do all the associated backend & database activity required) just to create another release immediately after (because we just uploaded a schema, and that mens "new release").
What is the state of a freshly created release (before we add a schema)? Is it a release that nobody except the author can access? Is it actually a valid release or a placeholder? This is not captured in the design but comes up in the implementation.
The issue mentions multiple schemas (though the prototype should focus on supporting a single schema only). How would that work? Upload of each schema triggers a new release? Or is there some kind of workflow that we're supposed to be doing behind the scenes and finally transition to "done" at some point.
To answer somewhat to previous point: seems like we should be doing something behind the scenes - re-validate all data against the new schema. This will be a relatively long running operation, so what is the state of the release during that time?

I think most of the questions and unknowns can be dealt with by doing the following:

since schema change <==> new release, it makes most sense for the the PUT /data/:series-slug/release/:release-slug route to accept schemas/schema references in the JSON-LD document (and the backend should disallow updating that piece of data within a release).
If we allow schema references in the input JSON-LD document, we need a place for the actual schemas - hence additional HTTP endpoint, not tied to releases (the schemas could still accessible under /data/:series-slug/release/:release-slug/schemas/* as a convenience)
The backend should respond with 202 Accepted status code, because we may just have triggered a re-validation of gigabytes of data. And the release is not actually created if the validation fails. We would also include a URL where client can check the status of that task. Alternatively, the endpoint is not a "create release," but "submit new release request" and we can return 201, and perhaps decouple our URL paths from how data is kept in the DB, as Scott hinted in our meeting.

Also, is we substitute revisions for releases and files (CSV etc) for schemas, we see similar issues. But looks like the pattern outlined above takes care of them, too.

Below is a sketch of what the endpoints and example document fragments would look like:

/data/files
        83bf7fcd913... (hash of file contents)
    fe7f1034d4a... (hash of file contents)

/data/schemas
        schema1
        schema2

/data/:series/release/:release/release1
        {"@id": "release1",
                  "schemas": ["http://$HOST/data/schemas/schema1", 
                                        // inline schema
                                        {"dh:columns" [{"csvw:datatype" "string"
                                                                    "csvw:name" "foo_bar"
                                                                    "csvw:titles" ["Foo Bar"]}]}]}

/data/:series/release/:release/revision/:revision/r1
        {"@id": "r1",
         "files" ["83bf7fcd913...", "fe7f1034d4a..."]}

Note: don't mind the actual contents of the file entry, the point is it's referring to the resource under /data/files, I'm not sure at this point whether the contents should be full URIs or if we should allow different kind of references (e.g. full URIs, local references etc)

kiramclean commented 1 year ago

For now we only need to support a single schema per release, and we can require that the release is created first. This is meant to be captured in the proposed route (/data/:series-slug/release/:release-slug/schemas -- you can't post to a route for a release that doesn't exist, you need to know the release slug for the release you're creating a schema for).

To address your points from above:

"new schema" means "new release" in terms of the commitments about the shape of the data. There's no requirement that they be created simultaneously.
A new release without a schema just means that there are no commitments being made about the data -- it is a valid release it just provides no information about the shape of the data. The standard workflow for a user will be to create a release, then add a schema, then add some data that conforms to the schema. For this prototype we only need to support this simple workflow. We will handle cases where these creates happen out of order if we build a production version of this API later.
No need to worry about multiple schemas, this is not supported. If users need to change the schema, they will create a new release and give it a new schema.
This prototype is limited to very small datasets. If we do need to worry about re-validating data it won't take long enough to worry about the UX implications for the user.

Swirrl / datahost-prototypes

POST | GET schema's to join to releases #79