frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
700 stars 148 forks source link

Incorrect errors when validating data with a missing required column #1611

Closed pierrecamilleri closed 7 months ago

pierrecamilleri commented 9 months ago

Overview

In the process of migration from v4 to v5 in validata, we experienced some incorrect errors in the case of a missing required column.

Here is some python code to reproduce :

import frictionless

source = [["B", "C"], ["b", "c"]]
schema = {
    "$schema": "https://frictionlessdata.io/schemas/table-schema.json",
    "fields": [
        {
            "name": "A",
            "title": "Field A",
            "type": "string",
            "constraints": {"required": True},
        },
        {"name": "B", "title": "Field B", "type": "string"},
        {"name": "C", "title": "Field C", "type": "string"},
    ],
}

if __name__ == "__main__":
    schema = frictionless.Schema.from_descriptor(schema)
    resource = frictionless.Resource(
        source, schema=schema, detector=frictionless.Detector(schema_sync=True)
    )
    report = frictionless.validate(resource)
    print(report)

Output :

{'valid': False,
 'stats': {'tasks': 1, 'errors': 3, 'warnings': 0, 'seconds': 0.005},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'memory',
            'type': 'table',
            'valid': False,
            'place': '<memory>',
            'labels': ['B', 'C'],
            'stats': {'errors': 3,
                      'warnings': 0,
                      'seconds': 0.005,
                      'fields': 3,
                      'rows': 1},
            'warnings': [],
            'errors': [{'type': 'missing-label',
                        'title': 'Missing Label',
                        'description': 'Based on the schema there should be a '
                                       "label that is missing in the data's "
                                       'header.',
                        'message': "There is a missing label in the header's "
                                   'field "A" at position "3"',
                        'tags': ['#table', '#header', '#label'],
                        'note': '',
                        'labels': ['B', 'C'],
                        'rowNumbers': [1],
                        'label': '',
                        'fieldName': 'A',
                        'fieldNumber': 3},
                       {'type': 'constraint-error',
                        'title': 'Constraint Error',
                        'description': 'A field value does not conform to a '
                                       'constraint.',
                        'message': 'The cell "None" in row at position "2" and '
                                   'field "A" at position "3" does not conform '
                                   'to a constraint: constraint "required" is '
                                   '"True"',
                        'tags': ['#table', '#row', '#cell'],
                        'note': 'constraint "required" is "True"',
                        'cells': ['b', 'c'],
                        'rowNumber': 2,
                        'cell': 'None',
                        'fieldName': 'A',
                        'fieldNumber': 3},
                       {'type': 'missing-cell',
                        'title': 'Missing Cell',
                        'description': 'This row has less values compared to '
                                       'the header row (the first row in the '
                                       'data source). A key concept is that '
                                       'all the rows in tabular data must have '
                                       'the same number of columns.',
                        'message': 'Row at position "2" has a missing cell in '
                                   'field "A" at position "3"',
                        'tags': ['#table', '#row', '#cell'],
                        'note': '',
                        'cells': ['b', 'c'],
                        'rowNumber': 2,
                        'cell': '',
                        'fieldName': 'A',
                        'fieldNumber': 3}]}]}

Observed behavior

There are three errors among which :

Expected behavior

I would expect to only get the first missing-label error.

Other details and experimentations

Frictionless version 5.16.0

Same result with command line validation. I have put "schema-sync" to reproduce more closely our use case, but it does not seem to be related with the actual issue.

Inspecting "row" on "validator/validator.py", l151 :

                        row = next(resource.row_stream)  # type: ignore

returns an artificially added A property :

{'B': 'b', 'C': 'c', 'A': None}
pierrecamilleri commented 8 months ago

After investigating : 

Naively removing missing columns from field_info at creation breaks approx. 60 tests. We'll experiment to do the same down the road, see if we can make it work, e.g. in the __process method.

As a side note and as feedback from our exploration, the design choice to loop on field_info for the validation, which is directly derived from resource.schema.fields, which in turn can be mutated during the process (at least for schema_sync = true) was a bit unsettling to us (we would have expected to loop on the table labels instead, as columns may be missing, and to find in resource.schema.fields the schema fields similar to what is in the schema).