frictionlessdata / frictionless-py

Data management framework for Python that provides functionality to describe, extract, validate, and transform tabular data
https://framework.frictionlessdata.io
MIT License
710 stars 148 forks source link

reading data from pandas returns null values #1678

Closed jgunstone closed 1 month ago

jgunstone commented 1 month ago

following the docs: https://framework.frictionlessdata.io/docs/formats/pandas.html

to reproduce:

mamba create -n dp-test python pandas -y
mamba activate dp-test 
pip install frictionless frictionless[pandas]
# >> python
# Python 3.12.5 | packaged by conda-forge | (main, Aug  8 2024, 18:36:51) [GCC 12.4.0] on linux
# Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> x = range(0, 5)
>>> y = [_**2 for _ in x]
>>> df = pd.DataFrame({"x": x, "y": y})
>>> from pprint import pprint
>>> pprint(df)
   x   y
0  0   0
1  1   1
2  2   4
3  3   9
4  4  16
>>> from frictionless import Resource
>>> r = Resource(df)
>>> r.read_rows()
[{'x': None, 'y': None}, {'x': None, 'y': None}, {'x': None, 'y': None}, {'x': None, 'y': None}, {'x': None, 'y': Non
pierrecamilleri commented 1 month ago

Thanks for the report. I can reproduce.

I noticed a weird behaviour, where adding a string, boolean or decimal number column results in the correct output, but a dataframe with any number of only integer columns triggers the reported bug.

jgunstone commented 1 month ago

yh - just to demonstrate your point, this works:

>>> import pandas as pd
>>> x = range(0, 5)
>>> y = [_**2 for _ in x]
>>> z = [_*1.2 for _ in x]
>>> df = pd.DataFrame({"x": x, "y": y, "z": z})

>>> from frictionless import Resource
>>> r = Resource(df)
>>> r.read_rows()

[{'x': 0, 'y': 0, 'z': Decimal('0.0')},
 {'x': 1, 'y': 1, 'z': Decimal('1.2')},
 {'x': 2, 'y': 4, 'z': Decimal('2.4')},
 {'x': 3, 'y': 9, 'z': Decimal('3.5999999999999996')},
 {'x': 4, 'y': 16, 'z': Decimal('4.8')}]
pierrecamilleri commented 1 month ago

Some exploration notes :

I have not looked why adding another dtype in the dataframe solves the issue, it is probably triggering a conversion somewhere.

EDIT : further observations

df_int = pd.Series([1, 2])
print(type(df_int[0]))
// <class 'numpy.int64'>

df_mixed = pd.Series([1, "a"])
print(df_mixed.dtypes )
// object
print(type(df_mixed[0]))
// <class 'int'>

df.iterrows() return pd.Series. Mixed series have always dtype('O') python object type, and types are coerced to python types. Homogenous type series keep there numpy types, which are not always instances of python types. It is in particular the case for np.int64 and np.True_/False_ , whereas strings and floats are accepted as instances of python's str and float.