Closed RickMoynihan closed 1 year ago
Just to be sure: You can't update a schema, only create one? So in this prototype the idea is we only allow one schema to be associated with a single release. And schemas apply to all future releases going forward, until another schema is added? i.e. there is only ever one schema in effect and it is the latest one? Can you backtrack on a schema? e.g. say there is a mistake in a schema, can you publish a new one to supersede the old one with the correction?
I think we can build the prototype so that you can replace a schema.
However we can argue that the only reason you can do that, is because when we're making all these changes through the API we're operating in one logical draftset (it just doesn't exist yet). We can argue that when we've integrated with drafter and draftsets immutability would only be enforced "post publication".
Post publication you can't backtrack on a schema.
We also need to preserve invariants like this:
Hope that makes sense.
Did some thinking about this, and while I can just write something that mostly matches the description/comments so far, I'm having doubts as that's going to be the best direction (we can discuss on ONS catchup call).
In short: as a fallout of some other ideas, it's probably not the best idea to tie the schemas to releases in the way it is presented above. If we made them standalone entities, that would open the possibility of re-using schemas by other releases (how, if at all, that's presented in the UI is of course out of scope of this project, but at least the option would be there).
The way releases and schemas should work, as outlined in this issue is OK at first reading, but the more you think of implementation, the more weaknesses you can see.
So here's how it's supposed to work according to the GH issue:
Here are the issues with this approach:
schema change <==> new release
. So we create a new release (and do all the associated backend & database activity required) just to create another release immediately after (because we just uploaded a schema, and that mens "new release").I think most of the questions and unknowns can be dealt with by doing the following:
schema change <==> new release
, it makes most sense for the the PUT /data/:series-slug/release/:release-slug
route to accept schemas/schema references in the JSON-LD document (and the backend should disallow updating that piece of data within a release). /data/:series-slug/release/:release-slug/schemas/*
as a convenience)202 Accepted
status code, because we may just have triggered a re-validation of gigabytes of data. And the release is not actually created if the validation fails. We would also include a URL where client can check the status of that task. Alternatively, the endpoint is not a "create release," but "submit new release request" and we can return 201, and perhaps decouple our URL paths from how data is kept in the DB, as Scott hinted in our meeting.Also, is we substitute revisions for releases and files (CSV etc) for schemas, we see similar issues. But looks like the pattern outlined above takes care of them, too.
Below is a sketch of what the endpoints and example document fragments would look like:
/data/files
83bf7fcd913... (hash of file contents)
fe7f1034d4a... (hash of file contents)
/data/schemas
schema1
schema2
/data/:series/release/:release/release1
{"@id": "release1",
"schemas": ["http://$HOST/data/schemas/schema1",
// inline schema
{"dh:columns" [{"csvw:datatype" "string"
"csvw:name" "foo_bar"
"csvw:titles" ["Foo Bar"]}]}]}
/data/:series/release/:release/revision/:revision/r1
{"@id": "r1",
"files" ["83bf7fcd913...", "fe7f1034d4a..."]}
Note: don't mind the actual contents of the file
entry, the point is it's referring to the resource under /data/files
, I'm not sure at this point whether the contents should be full URIs or if we should allow different kind of references (e.g. full URIs, local references etc)
For now we only need to support a single schema per release, and we can require that the release is created first. This is meant to be captured in the proposed route (/data/:series-slug/release/:release-slug/schemas
-- you can't post to a route for a release that doesn't exist, you need to know the release slug for the release you're creating a schema for).
To address your points from above:
We should review the scratch schema code and integrate it as rest API routes.
Ultimately we may want to support many schemas being associated with a single release. However in this iteration of the prototype we can assume just one.
Multiple schemas may not be necessary; but they may support more cleanly allowing people to incrementally add commitments, or layer on other concerns (e.g. URI generation) to existing datasets (providing all data within that release historically stills conforms to the extra schema).
Implement POST / REDIRECT / GET pattern for creating schemas.
Route:
Note normalising the incoming request body should:
"@type" "dh:TableSchema"
as a managed param to the table schema"@type" "dh:DimensionColumn"
to each column"datahost:appliesToRelease" "</data/:series-slug/release/:release-slug>"
"appropriate-csvw:modeling-of-dialect" "UTF-8,RFC4180"
Server responds:
303 /data/:series-slug/release/:release-slug/schemas/:auto-incrementing-schema-id
Route:
GET /data/:series-slug/release/:release-slug/schemas Accept: json+ld
Note the release will also get the inverse triple:
</data/:series-slug/release/:release-slug> datahost:hasSchema </data/:series-slug/release/:releases-slug/schemas/:schema-id>