frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
696 stars 145 forks source link

Proposal for CLI redesign #1162

Open vlcinsky opened 2 years ago

vlcinsky commented 2 years ago

I have to admit, that apart from asking for an option to use CLI to validate only selected resource defined in a package (and do similarly sub-resource oriented actions) it turned into request to redesign CLI what is not a small thing.

Please, bear with me. Frictionless Data seem to be my fulfilled dream (I was in search for something like that for years), on the other hand I got confused by CLI too many times. As creating CLI tools is something I do for many years, I tried to describe an alternative, which could probably remove some problems I have experienced.

Real world example: validate single resource from a package

I have 24 CSV files, some of them rather long, and want to specify package. The structure is complex, there are constraints, primary keys, foreign keys.

frictionless describe *.CSV > package.yaml creates one large package descriptor, but for fine tuning resource definitions I need to validate them repeatedly.

It is possible today to validate only the whole package what takes time. It would be very helpful, if I could validate only selected resource defined within the package, e.g. by: frictionless validate --resource countries package.yaml

Note, that due to the current CLI design, which tries to assume or detect many things, it is not always very clear what is going to happen, e.g.:

Gallery of similar scenarios

Aspects of CLI call

The CLI call must explicitly or implicitly decide set of questions:

Personally I got surprised a few times:

I understand the intention to provide "easy to use and intuitive tool" but in fact, auto-detection of things might bring confusion which finally make things more complicated and less predictable.

Technical options for CLI

There is click library which is currently used. click allows nested sub-commands (without real limit in nesting). This concept shall be expressive enough to provide all required input information. Another advantage of more specific (sub)commands is, that they could be more strict on provided input and complain in more specific way addressing the thing to fix more specifically.

Another option is similar to current frictionless transform where the pipeline can be very specific on what shall be done. Anyway, I am afraid this approach would be less user friendly as it requires to learn how to define the pipeline.

Last option is to use some (sub)resource addressing scheme similarly as pytest specifies which test to run, e.g. pytest test_mod.py::test_func. Similar approach could be used to specify a tabular resource defined within tabular data package.

Some CLI examples

Here are some examples of alternative CLI design. It builds on:

The infer variants would become really format specific:

What to do next

The proposal above is definitely not complete (missing api, summary and transform), but it should allow first evaluation if the proposal seems reasonable.

If you would agree on it, I could contribute stub click command implementation to prove, it would be very instructive to users.

roll commented 2 years ago

Hi @vlcinsky,

Thanks for a great and detailed issue! I don't think we can introduce such breaking change for the whole CLI so might be an alternative CLI runner package might be an option?

vlcinsky commented 2 years ago

Do you mean separate python package (installed independently from frictionless-py) or alternative CLI within existing package?

You are right, the change is extensive. In long term I could imagine, we start with alternative CLI within frictionless-py, keep it marked as experimental or beta for a while until it matures and finally deprecate the existing CLI to keep it more manageable.

roll commented 2 years ago

Yes I think an additional CLI runner package similar to js projects like webpack-cli. It might be a good first step to test the idea

fjuniorr commented 1 year ago

It is possible today to validate only the whole package what takes time. It would be very helpful, if I could validate only selected resource defined within the package, e.g. by: frictionless validate --resource countries package.yaml

Just wanted to point out that after https://github.com/frictionlessdata/framework/pull/1112 it's possible to validate a single resource from a data package with

frictionless validate --json --resource-name foo datapackage.json

This will correctly identify eventual validation errors coming from foreign keys constraints.