datahq / dataflows

DataFlows is a simple, intuitive lightweight framework for building data processing flows in python.
https://dataflows.org
MIT License
193 stars 39 forks source link

[WIP] Added collect_errors processor #144

Closed roll closed 3 years ago

roll commented 3 years ago

Hi @cschloer,

I've checked options for goodtables integration and it doesn't seem doable till the end of this stage. So I dove into the initial issue - https://github.com/BCODMO/laminar_web/issues/801 - and tried to implement something simpler. For example, we can have a collect_errors processor that works like this:

def test_collect_errors():
    from dataflows import load, collect_errors, Flow
    res, dp, stats = Flow(
        load('data/cities.csv', name='cities'),
        # it accepts an "Extra Schema" that can be partial
        collect_errors({'fields': [{'name': 'city', 'constraints': {'maxLength': 4}}]}),
    ).results()
    assert dp.get_resource('cities').descriptor.get('errors') == [
        'Field "city" has constraint "maxLength" which is not satisfied for value "london"',
        'Field "city" has constraint "maxLength" which is not satisfied for value "paris"'
    ]

On the other hand, all the checks @adyork had asked can be set in the main schema e.g. using constraints.pattern - then the validation will fails in dataflows without this processor.

adyork commented 3 years ago

Thanks for looking into this @roll. I'm not sure this can get us to the usecase we laid out in https://github.com/BCODMO/laminar_web/issues/801 since it seems to be able to check values of fields, but not the field names.

For example, in your above example it says:

collect_errors({'fields': [{'name': 'city', 'constraints': {'maxLength': 4}}]}) 'Field "city" has constraint "maxLength" which is not satisfied for value "london"'

But we are looking for something that can check the field name itself ('city' in your example) for constraints: like only [_0-9A-Za-z] allowed, and does not begin with a number.

As for maybe using the constraints.pattern in schema, I think that has the same problem of looking at the values not the field names? e.g. https://specs.frictionlessdata.io/table-schema/#constraints is there somewhere else in the documentation I should be looking for that?

Let me know if I am missing something here.

roll commented 3 years ago

Oh I see now - what if we just a create a processor that validates a field (metadata) against JSON Schema?

roll commented 3 years ago

cc @cschloer, WDYT?

lwinfree commented 3 years ago

@adyork, 😄 hey! @roll and I were talking a bit about this today. Would Evgeny's solution (https://github.com/datahq/dataflows/pull/144#issuecomment-669311044) work for the use case?

akariv commented 3 years ago

@roll - why not use the onerror for set_type or validate? it seems to accomplish the same purpose, no?

roll commented 3 years ago

I think we need a declarative way but this PR is not what we need anyway so I'm closing. Thanks