frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
686 stars 142 forks source link

Apparent inconsistencies in validation of in-memory data from a pandas dataframe #1599

Open pschumm opened 9 months ago

pschumm commented 9 months ago

We have a lot of workflows that involve curating data in pandas and then as a final step validating those data against a target schema. In general this works well, though new folks often trip over a few apparent inconsistencies relative to corresponding operations that don't involve pandas. For example, consider the following schema (in mydata.schema.yaml):

fields:
  - name: a
    type: integer
  - name: b
    type: year

and the following data file (in mydata.csv):

a,b
1,1972
,2001
22,2023
,1985

These data are clearly valid according to the schema (using Frictionless 5.15.10):

from frictionless import Resource
resource = Resource('mydata.csv', schema='mydata.schema.yaml')
assert resource.validate().valid
print(resource.extract())

yielding:

{'mydata': [{'a': 1, 'b': 1972}, {'a': None, 'b': 2001}, {'a': 22, 'b': 2023}, {'a': None, 'b': 1985}]}

However, now consider reading those data into a pandas dataframe (using the default settings for read_csv()) prior to validation:

import pandas as pd
df = pd.read_csv('mydata.csv')
print(df)
print(df.dtypes)
resource = Resource(df, schema='mydata.schema.yaml')
print(resource.extract())
print(resource.validate())

yielding (again using Frictionless 5.15.10):

      a     b
0   1.0  1972
1   NaN  2001
2  22.0  2023
3   NaN  1985
a    float64
b      int64
dtype: object
{'memory': [{'a': 1, 'b': None}, {'a': None, 'b': None}, {'a': 22, 'b': None}, {'a': None, 'b': None}]}
{'valid': False,
 'stats': {'tasks': 1, 'errors': 6, 'warnings': 0, 'seconds': 0.001},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'memory',
            'type': 'table',
            'valid': False,
            'place': '<memory>',
            'labels': ['a', 'b'],
            'stats': {'errors': 6,
                      'warnings': 0,
                      'seconds': 0.001,
                      'fields': 2,
                      'rows': 4},
            'warnings': [],
            'errors': [{'type': 'type-error',
                        'title': 'Type Error',
                        'description': 'The value does not match the schema '
                                       'type and format for this field.',
                        'message': 'Type error in the cell "1972.0" in row "2" '
                                   'and field "b" at position "2": type is '
                                   '"year/default"',
                        'tags': ['#table', '#row', '#cell'],
                        'note': 'type is "year/default"',
                        'cells': ['1.0', '1972.0'],
                        'rowNumber': 2,
                        'cell': '1972.0',
                        'fieldName': 'b',
                        'fieldNumber': 2},
                       {'type': 'type-error',
                        'title': 'Type Error',
                        'description': 'The value does not match the schema '
                                       'type and format for this field.',
                        'message': 'Type error in the cell "nan" in row "3" '
                                   'and field "a" at position "1": type is '
                                   '"integer/default"',
                        'tags': ['#table', '#row', '#cell'],
                        'note': 'type is "integer/default"',
                        'cells': ['nan', '2001.0'],
                        'rowNumber': 3,
                        'cell': 'nan',
                        'fieldName': 'a',
                        'fieldNumber': 1},

<snip>

Since pandas does not use nullable integers by default, the first column (a) gets read into a float, while the second (b) gets read into an int (as expected). However, as you can see, neither is valid. One can sidestep this with, say:

df = df.fillna('')
df['b'] = df.b.astype(str)
print(df)
print(df.dtypes)
resource = Resource(df, schema='mydata.schema.yaml')
print(resource.extract())
print(resource.validate().valid)

yielding:

      a     b
0   1.0  1972
1        2001
2  22.0  2023
3        1985
a    object
b    object
dtype: object
{'memory': [{'a': 1, 'b': 1972}, {'a': None, 'b': 2001}, {'a': 22, 'b': 2023}, {'a': None, 'b': 1985}]}
True

or, in the case of a, by specifying keep_default_na=False in the call to read_csv() or by setting missingValues (in the schema) to ['nan']. This demonstrates that data which are valid when evaluated directly via Frictionless may not be valid when going through pandas using default options. More specifically:

  1. When using integers with missing values (np.nan) stored as floats in pandas, it appears that these must either be translated first to empty strings or the string "nan" must be added to the schema's missingValues property.
  2. It appears that integer is not an acceptable type for a year field (even without any missing values), and gets translated to float during validation (e.g., 1972.0 and 2001.0 in the error messages above).

Am I just missing something about the way validation of in-memory data (e.g., from a pandas dataframe) is intended to work? Is it correct to conclude that columns in a pandas dataframe should really be converted to object type before creating a Frictionless tabular resource, and if so, should that be done by default?

Thanks in advance for taking the time to answer these questions.

pschumm commented 4 months ago

Here is another example of this general issue; namely, that validating a data file directly with Frictionless can yield a different result from validating a tabular data resource constructed from a pandas dataframe which was created by reading that same data file (using read_csv()) with default settings. Here is the data file (named mydata.csv):

a,b,c
foo,NA,2024-02
,2024-02-01,
bar,,NA

and here is the corresponding schema (mydata.schema.yaml):

fields:
  - name: a
    type: string
    constraints:
      enum: ["foo","bar"]
  - name: b
    type: date
  - name: c
    type: yearmonth
missingValues: ["NA"]

This data file is clearly invalid according to this schema; the fact that missingValues is set to ["NA"], thereby excluding the default value of "", means that the enum constraint on a will fail as will the specification of the types date and yearmonth. Witness:

from frictionless import Resource
resource = Resource('mydata.csv', schema='mydata.schema.yaml')
print(resource.extract())
print(resource.validate())

which yields:

{'mydata': [{'a': 'foo', 'b': None, 'c': yearmonth(year=2024, month=2)}, {'a': '', 'b': datetime.date(2024, 2, 1), 'c': None}, {'a': 'bar', 'b': None, 'c': None}]}
{'valid': False,
 'stats': {'tasks': 1, 'errors': 3, 'warnings': 0, 'seconds': 0.001},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'mydata',
            'type': 'table',
            'valid': False,
            'place': 'mydata.csv',
            'labels': ['a', 'b', 'c'],
            'stats': {'errors': 3,
                      'warnings': 0,
                      'seconds': 0.001,
                      'md5': '42d7314c8c2f546c1695f16729fac05d',
                      'sha256': 'd6512f4ac159e50de74c366cbfc810345450727e8f465c32329faedde1db996d',
                      'bytes': 42,
                      'fields': 3,
                      'rows': 3},
            'warnings': [],
            'errors': [{'type': 'constraint-error',
                        'title': 'Constraint Error',
                        'description': 'A field value does not conform to a '
                                       'constraint.',
                        'message': 'The cell "" in row at position "3" and '
                                   'field "a" at position "1" does not conform '
                                   'to a constraint: constraint "enum" is '
                                   '"[\'foo\', \'bar\']"',
                        'tags': ['#table', '#row', '#cell'],
                        'note': 'constraint "enum" is "[\'foo\', \'bar\']"',
                        'cells': ['', '2024-02-01', ''],
                        'rowNumber': 3,
                        'cell': '',
                        'fieldName': 'a',
                        'fieldNumber': 1},
                       {'type': 'type-error',
                        'title': 'Type Error',
                        'description': 'The value does not match the schema '
                                       'type and format for this field.',
                        'message': 'Type error in the cell "" in row "3" and '
                                   'field "c" at position "3": type is '
                                   '"yearmonth/default"',
                        'tags': ['#table', '#row', '#cell'],
                        'note': 'type is "yearmonth/default"',
                        'cells': ['', '2024-02-01', ''],
                        'rowNumber': 3,
                        'cell': '',
                        'fieldName': 'c',
                        'fieldNumber': 3},
                       {'type': 'type-error',
                        'title': 'Type Error',
                        'description': 'The value does not match the schema '
                                       'type and format for this field.',
                        'message': 'Type error in the cell "" in row "4" and '
                                   'field "b" at position "2": type is '
                                   '"date/default"',
                        'tags': ['#table', '#row', '#cell'],
                        'note': 'type is "date/default"',
                        'cells': ['bar', '', 'NA'],
                        'rowNumber': 4,
                        'cell': '',
                        'fieldName': 'b',
                        'fieldNumber': 2}]}]}

as expected. Now, let's read that same data file into pandas using the default options:

import pandas as pd
df = pd.read_csv('mydata.csv')
print(df.dtypes)
print(df)

As you can see, all three columns are stored with dtype object, and both the string "NA" and the empty cells are represented by NaN:

a    object
b    object
c    object
dtype: object
     a           b        c
0  foo         NaN  2024-02
1  NaN  2024-02-01      NaN
2  bar         NaN      NaN

Now, creating a resource from this data frame and validating it:

resource = Resource(df, schema='mydata.schema.yaml')
print(resource.extract())
print(resource.validate())

yields a valid result:

{'memory': [{'a': 'foo', 'b': None, 'c': yearmonth(year=2024, month=2)}, {'a': None, 'b': datetime.date(2024, 2, 1), 'c': None}, {'a': 'bar', 'b': None, 'c': None}]}
{'valid': True,
 'stats': {'tasks': 1, 'errors': 0, 'warnings': 0, 'seconds': 0.001},
 'warnings': [],
 'errors': [],
 'tasks': [{'name': 'memory',
            'type': 'table',
            'valid': True,
            'place': '<memory>',
            'labels': ['a', 'b', 'c'],
            'stats': {'errors': 0,
                      'warnings': 0,
                      'seconds': 0.001,
                      'fields': 3,
                      'rows': 3},
            'warnings': [],
            'errors': []}]}

What's important here are not the specifics (e.g., the fact that I used "NA" to represent missing values which is automatically interpreted by read_csv() as missing), but the fact that an unsuspecting pandas user might find this difference in behavior to be surprising, if not a bit frustrating (depending on their workflow).

Now, in this example, I can perhaps see a rationale for evaluating the pandas dataframe as valid. Moreover, I'm not even sure if it is possible (or even desirable) to ensure that all possible data files yield the same result when following these two procedures. However, I do think that it would be good to summarize the differences in a concise way and then provide a cautionary note in the documentation. I'm willing to draft something once @roll has decided which elements of the behavior(s) illustrated above he might want to change/fix (since he tagged this as a bug).