blaze / odo

Data Migration for the Blaze Project
http://odo.readthedocs.org/
BSD 3-Clause "New" or "Revised" License
1.01k stars 138 forks source link

Categorical support broken with recent pandas #597

Closed dhirschfeld closed 6 years ago

dhirschfeld commented 7 years ago

test_discover is broken for me:

import pandas as pd
import pandas.util.testing as tm
import numpy as np
import dask.dataframe as dd
from datashape import var, Record, int64, float64, Categorical
from datashape.util.testing import assert_dshape_equal

from odo import convert, discover

def test_discover():
    df = pd.DataFrame({'x': list('a'*5 + 'b'*5 + 'c'*5),
                       'y': np.arange(15, dtype=np.int64),
                       'z': list(map(float, range(15)))},
                       columns=['x', 'y', 'z'])
    df.x = df.x.astype('category')
    ddf = dd.from_pandas(df, npartitions=2)
    assert_dshape_equal(discover(ddf),
                        var * Record([('x', Categorical(['a', 'b', 'c'])),
                                            ('y', int64), ('z', float64)]))
    assert_dshape_equal(discover(ddf.x), var * Categorical(['a', 'b', 'c']))
Traceback (most recent call last):

  File "<ipython-input-2-75fca8249e6d>", line 7, in <module>
    assert_dshape_equal(discover(ddf),

  File "C:\Miniconda3\lib\site-packages\multipledispatch\dispatcher.py", line 164, in __call__
    return func(*args, **kwargs)

  File "C:\Miniconda3\lib\site-packages\odo\backends\dask.py", line 27, in discover_dask_dataframe
    return var * discover(df.head()).measure

  File "C:\Miniconda3\lib\site-packages\multipledispatch\dispatcher.py", line 164, in __call__
    return func(*args, **kwargs)

  File "C:\Miniconda3\lib\site-packages\odo\backends\pandas.py", line 39, in discover_dataframe
    for k in df.columns])

  File "C:\Miniconda3\lib\site-packages\odo\backends\pandas.py", line 39, in <listcomp>
    for k in df.columns])

  File "C:\Miniconda3\lib\site-packages\odo\backends\pandas.py", line 31, in dshape_from_pandas
    dshape = datashape.CType.from_numpy_dtype(col.dtype)

  File "C:\Miniconda3\lib\site-packages\datashape\coretypes.py", line 781, in from_numpy_dtype
    if np.issubdtype(dt, np.datetime64):

  File "C:\Miniconda3\lib\site-packages\numpy\core\numerictypes.py", line 755, in issubdtype
    return issubclass(dtype(arg1).type, arg2)

TypeError: data type not understood
dhirschfeld commented 7 years ago

In dshape_form_pandas there is an explicit test for categorical: https://github.com/blaze/odo/blob/ba84238eb8dbcac4784ae7ebf62988d7e163c283/odo/backends/pandas.py#L20-L33

But this is failing for me since the definition of categorical: https://github.com/blaze/odo/blob/ba84238eb8dbcac4784ae7ebf62988d7e163c283/odo/backends/pandas.py#L17

...gives a property object rather than an actual dtype:

In [5]: pd.Categorical.dtype
Out[5]: <property at 0x4843818>
In [6]: type(pd.Categorical.dtype)
Out[6]: property

This is with pandas 0.21.0 from conda-forge.