Open pschumm opened 9 months ago
Here is another example of this general issue; namely, that validating a data file directly with Frictionless can yield a different result from validating a tabular data resource constructed from a pandas dataframe which was created by reading that same data file (using read_csv()
) with default settings. Here is the data file (named mydata.csv
):
a,b,c
foo,NA,2024-02
,2024-02-01,
bar,,NA
and here is the corresponding schema (mydata.schema.yaml
):
fields:
- name: a
type: string
constraints:
enum: ["foo","bar"]
- name: b
type: date
- name: c
type: yearmonth
missingValues: ["NA"]
This data file is clearly invalid according to this schema; the fact that missingValues
is set to ["NA"]
, thereby excluding the default value of ""
, means that the enum
constraint on a
will fail as will the specification of the types date
and yearmonth
. Witness:
from frictionless import Resource
resource = Resource('mydata.csv', schema='mydata.schema.yaml')
print(resource.extract())
print(resource.validate())
which yields:
{'mydata': [{'a': 'foo', 'b': None, 'c': yearmonth(year=2024, month=2)}, {'a': '', 'b': datetime.date(2024, 2, 1), 'c': None}, {'a': 'bar', 'b': None, 'c': None}]}
{'valid': False,
'stats': {'tasks': 1, 'errors': 3, 'warnings': 0, 'seconds': 0.001},
'warnings': [],
'errors': [],
'tasks': [{'name': 'mydata',
'type': 'table',
'valid': False,
'place': 'mydata.csv',
'labels': ['a', 'b', 'c'],
'stats': {'errors': 3,
'warnings': 0,
'seconds': 0.001,
'md5': '42d7314c8c2f546c1695f16729fac05d',
'sha256': 'd6512f4ac159e50de74c366cbfc810345450727e8f465c32329faedde1db996d',
'bytes': 42,
'fields': 3,
'rows': 3},
'warnings': [],
'errors': [{'type': 'constraint-error',
'title': 'Constraint Error',
'description': 'A field value does not conform to a '
'constraint.',
'message': 'The cell "" in row at position "3" and '
'field "a" at position "1" does not conform '
'to a constraint: constraint "enum" is '
'"[\'foo\', \'bar\']"',
'tags': ['#table', '#row', '#cell'],
'note': 'constraint "enum" is "[\'foo\', \'bar\']"',
'cells': ['', '2024-02-01', ''],
'rowNumber': 3,
'cell': '',
'fieldName': 'a',
'fieldNumber': 1},
{'type': 'type-error',
'title': 'Type Error',
'description': 'The value does not match the schema '
'type and format for this field.',
'message': 'Type error in the cell "" in row "3" and '
'field "c" at position "3": type is '
'"yearmonth/default"',
'tags': ['#table', '#row', '#cell'],
'note': 'type is "yearmonth/default"',
'cells': ['', '2024-02-01', ''],
'rowNumber': 3,
'cell': '',
'fieldName': 'c',
'fieldNumber': 3},
{'type': 'type-error',
'title': 'Type Error',
'description': 'The value does not match the schema '
'type and format for this field.',
'message': 'Type error in the cell "" in row "4" and '
'field "b" at position "2": type is '
'"date/default"',
'tags': ['#table', '#row', '#cell'],
'note': 'type is "date/default"',
'cells': ['bar', '', 'NA'],
'rowNumber': 4,
'cell': '',
'fieldName': 'b',
'fieldNumber': 2}]}]}
as expected. Now, let's read that same data file into pandas using the default options:
import pandas as pd
df = pd.read_csv('mydata.csv')
print(df.dtypes)
print(df)
As you can see, all three columns are stored with dtype object, and both the string "NA" and the empty cells are represented by NaN
:
a object
b object
c object
dtype: object
a b c
0 foo NaN 2024-02
1 NaN 2024-02-01 NaN
2 bar NaN NaN
Now, creating a resource from this data frame and validating it:
resource = Resource(df, schema='mydata.schema.yaml')
print(resource.extract())
print(resource.validate())
yields a valid result:
{'memory': [{'a': 'foo', 'b': None, 'c': yearmonth(year=2024, month=2)}, {'a': None, 'b': datetime.date(2024, 2, 1), 'c': None}, {'a': 'bar', 'b': None, 'c': None}]}
{'valid': True,
'stats': {'tasks': 1, 'errors': 0, 'warnings': 0, 'seconds': 0.001},
'warnings': [],
'errors': [],
'tasks': [{'name': 'memory',
'type': 'table',
'valid': True,
'place': '<memory>',
'labels': ['a', 'b', 'c'],
'stats': {'errors': 0,
'warnings': 0,
'seconds': 0.001,
'fields': 3,
'rows': 3},
'warnings': [],
'errors': []}]}
What's important here are not the specifics (e.g., the fact that I used "NA" to represent missing values which is automatically interpreted by read_csv()
as missing), but the fact that an unsuspecting pandas user might find this difference in behavior to be surprising, if not a bit frustrating (depending on their workflow).
Now, in this example, I can perhaps see a rationale for evaluating the pandas dataframe as valid. Moreover, I'm not even sure if it is possible (or even desirable) to ensure that all possible data files yield the same result when following these two procedures. However, I do think that it would be good to summarize the differences in a concise way and then provide a cautionary note in the documentation. I'm willing to draft something once @roll has decided which elements of the behavior(s) illustrated above he might want to change/fix (since he tagged this as a bug).
We have a lot of workflows that involve curating data in pandas and then as a final step validating those data against a target schema. In general this works well, though new folks often trip over a few apparent inconsistencies relative to corresponding operations that don't involve pandas. For example, consider the following schema (in
mydata.schema.yaml
):and the following data file (in
mydata.csv
):These data are clearly valid according to the schema (using Frictionless 5.15.10):
yielding:
However, now consider reading those data into a pandas dataframe (using the default settings for
read_csv()
) prior to validation:yielding (again using Frictionless 5.15.10):
Since pandas does not use nullable integers by default, the first column (
a
) gets read into a float, while the second (b
) gets read into an int (as expected). However, as you can see, neither is valid. One can sidestep this with, say:yielding:
or, in the case of
a
, by specifyingkeep_default_na=False
in the call toread_csv()
or by settingmissingValues
(in the schema) to['nan']
. This demonstrates that data which are valid when evaluated directly via Frictionless may not be valid when going through pandas using default options. More specifically:np.nan
) stored as floats in pandas, it appears that these must either be translated first to empty strings or the string "nan" must be added to the schema'smissingValues
property.1972.0
and2001.0
in the error messages above).Am I just missing something about the way validation of in-memory data (e.g., from a pandas dataframe) is intended to work? Is it correct to conclude that columns in a pandas dataframe should really be converted to
object
type before creating a Frictionless tabular resource, and if so, should that be done by default?Thanks in advance for taking the time to answer these questions.