ckan / ckanext-validation

CKAN extension for validating Data Packages using Table Schema.
MIT License
28 stars 33 forks source link

Schema editor #65

Open amercader opened 2 years ago

amercader commented 2 years ago

Goal

Allow publishers to define the schema of tabular data as part of the resource creation process, internally generating a Table Schema that gets stored as the schema field

Prior work

@roll worked on an initial implementation a few years ago (ancient PR here: #25). It used tableschema-ui to render the UI, and under the hood tableschema-js to infer the data schema and generate a Table Schema object

https://user-images.githubusercontent.com/200230/171604305-a2b55f0b-35ca-4065-8934-d116f0303b76.mp4

Implementation options

UI-wise it is understood that we need update the component to use the new version,and that the UI/UX, form design, etc, needs to be definitely improved, but we have different options for the schema inferring part.

Option 1: Keep the inferring in the client with tableschema-js

Pros:

Cons:

Option 2: Use frictionless-py for the inferring

This of course requires the file to be uploaded to the server, as I don't think WASM-based solutions are ready for general production use.

Pros:

Cons:

Option 2a: Create the resource, infer the schema later

Users would create a resource normally and once is created we would infer the schema, redirect the user to a new step with the schema editor and allow them to tweak it further (but at this stage the inferred schema could already be stored in the created resource)

Option 2b: Upload the file first, infer the schema, create the resource later

This would be difficult to implement because right now uploads are closely tied to the actual resource, but we can imagine an implementation where the file is uploaded first (or linked), stored somewhere temporal, we run the inferring and return the result to the user, who then proceeds to create the resource, which is somehow linked to the uploaded file

amercader commented 1 year ago

@roll, any thoughts on tableschema-js vs frictionless-py? (see above)

roll commented 1 year ago

Hi @amercader,

I would vote for frictionless-py way as tableschema-js is more like in maintenance-only mode and a more realistic understanding of the situation is that OFKN will not be able to support it long-term. Actually, it's been already moved to the "Universe" from our core products and was maintained by Datopian (a little bit).

Technically, my suggestion would be:

I think a one-step arch is more promising as it might be used later to provide types for Data Pusher / Indexer. Although it needs to be investigated regarding compatibility with Excel/etc files

PS. Regarding UI I think we need to wait a few months to have new generation of Frictionless Components released

amercader commented 1 year ago

@roll sorry, revisiting this after a while. When you say

  • creating an endpoint that accepts a url or a file sample and returns a Resource descriptor inferred by frictionless-py.
  • on CKAN UI we can act on File Upload change event for it. So the user will get Resource/Schema editor during the main resource creation step

do you mean the following:

  1. User clicks on "Upload" and selects a file
  2. We listen to the file input event, and if it's a suitable file (ie tabular) we do a background HTTP request sending a sample of the file (or all of it if it's small enough) to an endpoint that gets a sample tabular data and outputs a Table Schema descriptor
  3. With the returned Table Schema descriptor returned we render the Schema Editor component

So essentially is option 2c: Upload a sample of the file, infer the schema, create the resource (and upload the file) Conceptually it doesn't seem far from 2b but without the complexities of re-factoring the whole CKAN upload process, so it can be a first initial step. Any thoughts on how big a sample we should upload to have reliable results?

Although it needs to be investigated regarding compatibility with Excel/etc files What do you mean, that for Excel files we might not be able to get a sample?

roll commented 1 year ago

@amercader Yes, it's a good flow description :+1: By default, frictionless uses quite a minimalistic:

In most cases, it works fine, and the user will be able to tweek the results anyway.

Regarding Excel, I think it will require sending the whole file to the server (or reading it client-side) just because of the format structure (ZIP index written at the end). I guess Excel is not so sensitive to the size problem as really big data usually in csv

amercader commented 1 year ago

Revised implementation plan after discussion with @aivuk

  1. When the user selects a tabular file, we upload it in the background using a custom endpoint that:
    • Creates a new resource with just the uploaded file
    • Infers the schema using the whole uploaded file
    • Returns the new resource_id and the inferred schema
  2. The user can keep entering the rest of the fields and when we get the inferred schema, we update the UI to show a preview and the schema editor
  3. When the user clicks "Save" (or "Save and add another") we call another custom endpoint that calls resource_patch on the previously created resource with the rest of the values sent.
  4. If the user clicks "Cancel" (or leaves the page?) we delete the resource and the file