apple / turicreate

Turi Create simplifies the development of custom machine learning models.
BSD 3-Clause "New" or "Revised" License
11.2k stars 1.14k forks source link

Can not convert a pandas data frame to SFrame #891

Open TobyRoseman opened 6 years ago

TobyRoseman commented 6 years ago

If a pandas data frame has an object column that contains NaN value, we can not convert it to an SFrame, and we get an unhelpful error message.

Turicreate version: 4.3.2 Python version: 3.6.5 (this is likely a bug only in Python 3)

Simple repo code with stack trace:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: import turicreate as tc

In [4]: temp = pd.DataFrame({'a': ['test', np.NaN]})

In [5]: tc.SFrame(temp)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
turicreate/cython/cy_flexible_type.pyx in turicreate.cython.cy_flexible_type.infer_flex_type_of_sequence()

turicreate/cython/cy_flexible_type.pyx in turicreate.cython.cy_flexible_type.flex_type_from_array_typecode()

ValueError: Type 'O' does not appear to be a valid array type code.

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
~/anaconda3/lib/python3.6/site-packages/turicreate/data_structures/sframe.py in __init__(self, data, format, _proxy)
    771                     for c in data.columns.values:
--> 772                         self.add_column(SArray(data[c].values), str(c), inplace=True)
    773                 elif (_format == 'sframe_obj'):

~/anaconda3/lib/python3.6/site-packages/turicreate/data_structures/sarray.py in __init__(self, data, dtype, ignore_cast_failure, _proxy)
    354                         # we need to get a bit more fine grained than that
--> 355                         dtype = infer_type_of_sequence(data)
    356                     if len(data.shape) == 2:

turicreate/cython/cy_flexible_type.pyx in turicreate.cython.cy_flexible_type.infer_type_of_sequence()

turicreate/cython/cy_flexible_type.pyx in turicreate.cython.cy_flexible_type.infer_type_of_sequence()

turicreate/cython/cy_flexible_type.pyx in turicreate.cython.cy_flexible_type.infer_flex_type_of_sequence()

turicreate/cython/cy_flexible_type.pyx in turicreate.cython.cy_flexible_type._infer_common_type_of_listlike()

turicreate/cython/cy_flexible_type.pyx in turicreate.cython.cy_flexible_type.infer_common_type()

TypeError: sequence item 0: expected str instance, bytes found

During handling of the above exception, another exception occurred:

TypeError                                 Traceback (most recent call last)
<ipython-input-5-4164fbdc309e> in <module>()
----> 1 tc.SFrame(temp)

~/anaconda3/lib/python3.6/site-packages/turicreate/data_structures/sframe.py in __init__(self, data, format, _proxy)
    812                     pass
    813                 else:
--> 814                     raise ValueError('Unknown input type: ' + format)
    815 
    816     @staticmethod

~/anaconda3/lib/python3.6/site-packages/turicreate/cython/context.py in __exit__(self, exc_type, exc_value, traceback)
     47             if not self.show_cython_trace:
     48                 # To hide cython trace, we re-raise from here
---> 49                 raise exc_type(exc_value)
     50             else:
     51                 # To show the full trace, we do nothing and let exception propagate

TypeError: sequence item 0: expected str instance, bytes found
RAbraham commented 5 years ago

Same problem here. I had to use df = df.replace({pd.np.nan: None}) to workaround this issue.

TobyRoseman commented 4 years ago

This is still an issue in TuriCreate 6.4

mdhanna commented 4 years ago

I had the same error but a different cause. My pandas dataframe was the result of a groupby and aggregating with pd.Series.mode to find the most frequent value. Even though running df.dtypes gave the expected output and there were no NaN values, somehow the dataframe from this operation was not accepted by SFrame. I just explicitly converted every column to its proper datatype, and now all is good.