Open amercader opened 2 years ago
@roll, any thoughts on tableschema-js vs frictionless-py? (see above)
Hi @amercader,
I would vote for frictionless-py
way as tableschema-js
is more like in maintenance-only mode and a more realistic understanding of the situation is that OFKN will not be able to support it long-term. Actually, it's been already moved to the "Universe" from our core products and was maintained by Datopian (a little bit).
Technically, my suggestion would be:
frictionless-py
. I think a one-step arch is more promising as it might be used later to provide types for Data Pusher / Indexer. Although it needs to be investigated regarding compatibility with Excel/etc files
PS. Regarding UI I think we need to wait a few months to have new generation of Frictionless Components released
@roll sorry, revisiting this after a while. When you say
- creating an endpoint that accepts a url or a file sample and returns a Resource descriptor inferred by frictionless-py.
- on CKAN UI we can act on File Upload change event for it. So the user will get Resource/Schema editor during the main resource creation step
do you mean the following:
So essentially is option 2c: Upload a sample of the file, infer the schema, create the resource (and upload the file) Conceptually it doesn't seem far from 2b but without the complexities of re-factoring the whole CKAN upload process, so it can be a first initial step. Any thoughts on how big a sample we should upload to have reliable results?
Although it needs to be investigated regarding compatibility with Excel/etc files What do you mean, that for Excel files we might not be able to get a sample?
@amercader Yes, it's a good flow description :+1: By default, frictionless uses quite a minimalistic:
In most cases, it works fine, and the user will be able to tweek the results anyway.
Regarding Excel, I think it will require sending the whole file to the server (or reading it client-side) just because of the format structure (ZIP index written at the end). I guess Excel is not so sensitive to the size problem as really big data usually in csv
Revised implementation plan after discussion with @aivuk
Goal
Allow publishers to define the schema of tabular data as part of the resource creation process, internally generating a Table Schema that gets stored as the
schema
fieldPrior work
@roll worked on an initial implementation a few years ago (ancient PR here: #25). It used tableschema-ui to render the UI, and under the hood tableschema-js to infer the data schema and generate a Table Schema object
https://user-images.githubusercontent.com/200230/171604305-a2b55f0b-35ca-4065-8934-d116f0303b76.mp4
Implementation options
UI-wise it is understood that we need update the component to use the new version,and that the UI/UX, form design, etc, needs to be definitely improved, but we have different options for the schema inferring part.
Option 1: Keep the inferring in the client with tableschema-js
Pros:
schema
fieldCons:
Option 2: Use frictionless-py for the inferring
This of course requires the file to be uploaded to the server, as I don't think WASM-based solutions are ready for general production use.
Pros:
Cons:
Option 2a: Create the resource, infer the schema later
Users would create a resource normally and once is created we would infer the schema, redirect the user to a new step with the schema editor and allow them to tweak it further (but at this stage the inferred schema could already be stored in the created resource)
Option 2b: Upload the file first, infer the schema, create the resource later
This would be difficult to implement because right now uploads are closely tied to the actual resource, but we can imagine an implementation where the file is uploaded first (or linked), stored somewhere temporal, we run the inferring and return the result to the user, who then proceeds to create the resource, which is somehow linked to the uploaded file