Schema editor - Githubissues

amercader commented 2 years ago

Goal

Allow publishers to define the schema of tabular data as part of the resource creation process, internally generating a Table Schema that gets stored as the schema field

Prior work

@roll worked on an initial implementation a few years ago (ancient PR here: #25). It used tableschema-ui to render the UI, and under the hood tableschema-js to infer the data schema and generate a Table Schema object

https://user-images.githubusercontent.com/200230/171604305-a2b55f0b-35ca-4065-8934-d116f0303b76.mp4

Implementation options

UI-wise it is understood that we need update the component to use the new version,and that the UI/UX, form design, etc, needs to be definitely improved, but we have different options for the schema inferring part.

Option 1: Keep the inferring in the client with tableschema-js

Pros:

Better UX as the schema can be modified before uploading the file
Easier to integrate in CKAN's resource creation flow, ie we use the component to generate a JSON Table Schema that directly gets submitted in the schema field
File size doesn't seem to be a concern as I tested a 800Mb and the schema was inferred without issue, I assume it parses a subset of the rows

Cons:

What are the plans for tableschema-js? Can we rely on it long term?
How good is the inferring? I assume most if not all recent work on this area has gone to frictionless-py
Would the schema generated by tableschema-js match the one generated by frictionless-py? Right now this is not important but I can imagine us having to implement some sort of server-side inferring for background jobs, etc, could we find inconsistencies between schemas generated by the two systems?

Option 2: Use frictionless-py for the inferring

This of course requires the file to be uploaded to the server, as I don't think WASM-based solutions are ready for general production use.

Pros:

We focus our efforts in just one Frictionless library (fricitonless-py), the one that is arguably better supported

Cons:

2-step process for creating a resource (3 if we count the previous dataset metadata step), file needs to be uploaded first, and then the schema can be returned to the user for tweaking.

Option 2a: Create the resource, infer the schema later

Users would create a resource normally and once is created we would infer the schema, redirect the user to a new step with the schema editor and allow them to tweak it further (but at this stage the inferred schema could already be stored in the created resource)

Option 2b: Upload the file first, infer the schema, create the resource later

This would be difficult to implement because right now uploads are closely tied to the actual resource, but we can imagine an implementation where the file is uploaded first (or linked), stored somewhere temporal, we run the inferring and return the result to the user, who then proceeds to create the resource, which is somehow linked to the uploaded file

amercader commented 1 year ago

@roll, any thoughts on tableschema-js vs frictionless-py? (see above)

roll commented 1 year ago

Hi @amercader,

I would vote for frictionless-py way as tableschema-js is more like in maintenance-only mode and a more realistic understanding of the situation is that OFKN will not be able to support it long-term. Actually, it's been already moved to the "Universe" from our core products and was maintained by Datopian (a little bit).

Technically, my suggestion would be:

creating an endpoint that accepts a url or a file sample and returns a Resource descriptor inferred by frictionless-py.
on CKAN UI we can act on File Upload change event for it. So the user will get Resource/Schema editor during the main resource creation step

I think a one-step arch is more promising as it might be used later to provide types for Data Pusher / Indexer. Although it needs to be investigated regarding compatibility with Excel/etc files

PS. Regarding UI I think we need to wait a few months to have new generation of Frictionless Components released

amercader commented 1 year ago

@roll sorry, revisiting this after a while. When you say

creating an endpoint that accepts a url or a file sample and returns a Resource descriptor inferred by frictionless-py.

on CKAN UI we can act on File Upload change event for it. So the user will get Resource/Schema editor during the main resource creation step

do you mean the following:

User clicks on "Upload" and selects a file
We listen to the file input event, and if it's a suitable file (ie tabular) we do a background HTTP request sending a sample of the file (or all of it if it's small enough) to an endpoint that gets a sample tabular data and outputs a Table Schema descriptor
With the returned Table Schema descriptor returned we render the Schema Editor component

So essentially is option 2c: Upload a sample of the file, infer the schema, create the resource (and upload the file) Conceptually it doesn't seem far from 2b but without the complexities of re-factoring the whole CKAN upload process, so it can be a first initial step. Any thoughts on how big a sample we should upload to have reliable results?

Although it needs to be investigated regarding compatibility with Excel/etc files What do you mean, that for Excel files we might not be able to get a sample?

roll commented 1 year ago

@amercader Yes, it's a good flow description :+1: By default, frictionless uses quite a minimalistic:

buffer size for encoding inference - 10 000 bytes
sample size for dialect/schema inference - 100 lines

In most cases, it works fine, and the user will be able to tweek the results anyway.

Regarding Excel, I think it will require sending the whole file to the server (or reading it client-side) just because of the format structure (ZIP index written at the end). I guess Excel is not so sensitive to the size problem as really big data usually in csv

amercader commented 1 year ago

Revised implementation plan after discussion with @aivuk

When the user selects a tabular file, we upload it in the background using a custom endpoint that:
- Creates a new resource with just the uploaded file
- Infers the schema using the whole uploaded file
- Returns the new resource_id and the inferred schema
The user can keep entering the rest of the fields and when we get the inferred schema, we update the UI to show a preview and the schema editor
When the user clicks "Save" (or "Save and add another") we call another custom endpoint that calls resource_patch on the previously created resource with the rest of the values sent.
If the user clicks "Cancel" (or leaves the page?) we delete the resource and the file

ckan / ckanext-validation

Schema editor #65

Goal

Prior work

Implementation options

Option 1: Keep the inferring in the client with tableschema-js

Option 2: Use frictionless-py for the inferring

Option 2a: Create the resource, infer the schema later

Option 2b: Upload the file first, infer the schema, create the resource later