datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
193 stars 39 forks source link

Improve validate processor #171

Open pwalsh opened 2 years ago

pwalsh commented 2 years ago

DF.validate() does some basic checks but doesn't validate everything that is possible based on Table Schema. In particular, it does not validate primary keys and we have noted that this creates other currently untraced bugs (e.g.: load from a package with invalid primary keys and try to dump again, the package will be incomplete).

We need to explore one of:

The problem with adopting Frictionless is that it can't be incrementally adopted AFAIK - the validation is built into the Resource class and I don't know just from reading the code where that leads (if / how it complicates our code when we use different libraries for managing Frictionless Data specs). Also, it sets state in memory (seen data for primary keys and foreign keys), and I guess based on other patterns in Dataflows we would want to store that data outside of the running python process ( e.g.: using https://github.com/akariv/kvfile ).

pwalsh commented 2 years ago

Currently known issues:

  1. Does not validate primary keys
  2. Does not validate foreign keys
  3. If field format is None (which is an invalid value according to the spec), it validates, but fails in dump_to_sql
  4. Does not validate field.constraints (e.g.: unique)
akariv commented 2 years ago

IIRC all of the validations are done in underlying libraries, so we might need to fix that there. There's a long standing issue of moving dataflows to use frictionless instead of tabulator/datapackage etc, so that also might be a good motivator.

On Sun, Sep 19, 2021 at 10:58 AM Paul Walsh @.***> wrote:

Currently known issues:

  1. Does not validate primary keys
  2. Does not validate foreign keys
  3. If field format is None (which is an invalid value according to the spec), it validates, but fails in dump_to_sql
  4. Does not validate field.constraints (e.g.: unique)

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/datahq/dataflows/issues/171#issuecomment-922439773, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACAY5ICAX36BJDCB2J4STLUCWQ27ANCNFSM5EKDK5DQ .

pwalsh commented 2 years ago

Looks like the only data validation is done via tableschema.Field.cast_value:

https://github.com/datahq/dataflows/blob/400b96f3bbaff8092f847e1eaa04ac34db42e031/dataflows/base/schema_validator.py#L73

As that only checks field values, it means that points (1) and (2) in https://github.com/datahq/dataflows/issues/171#issuecomment-922439773 are not checked, for point (3) I'm not sure what is going on, will need to create a failing test. For point (4), cast_value has an unusual signature where if constraints is True, the default, it does not check constraints, so that is also an issue.

https://github.com/frictionlessdata/tableschema-py/blob/main/tableschema/field.py#L138

There are all easily addressed, but I agree it may be a good motivator to explore moving this area to frictionless.