PAIR-code / facets

Visualizations for machine learning datasets
https://pair-code.github.io/facets/
Apache License 2.0
7.35k stars 887 forks source link

ProtoFromDataFrames fails for dataframes with categorical columns #237

Open ysayeed opened 3 years ago

ysayeed commented 3 years ago

When attempting to create the proto for facets-overview, if any of the columns are categorical, the operation will fail with an attribute error. I would expect it to properly parse the dataframe, treating the category dtype as a string and displaying it in the "Categorical Features" section in the same way.

Below is example code to produce this error and the traceback:

from facets_overview.generic_feature_statistics_generator import GenericFeatureStatisticsGenerator  
import pandas as pd  
df = pd.DataFrame({'col1': pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])})  
proto = GenericFeatureStatisticsGenerator().ProtoFromDataFrames([{'name': 'test', 'table': df}])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../facets_overview/base_generic_feature_statistics_generator.py", line 54, in ProtoFromDataFrames
    table_entries[col] = self.NdarrayToEntry(table[col])
  File ".../facets_overview/base_generic_feature_statistics_generator.py", line 119, in NdarrayToEntry
    data_type = self.DtypeToType(x.dtype)
  File ".../facets_overview/base_generic_feature_statistics_generator.py", line 66, in DtypeToType
    if dtype.char in np.typecodes['AllFloat']:
AttributeError: 'CategoricalDtype' object has no attribute 'char'

This is using facets-overview 1.0.0 and pandas 1.1.4.

jameswex commented 3 years ago

Yes it looks like the facets overview code doesn't support the Categorical type. You can change it to a series of standard strings and then the proto creation should work.

In order for this code to work on Categorical series out of the box, https://github.com/PAIR-code/facets/blob/master/facets_overview/python/base_generic_feature_statistics_generator.py#L69 would need to be updated to check for the Categorical dtype and return self.fs_proto.STRING in that case, before the current checks that use dtype.char (since the Categorical type doesn't have the char member variable).

ysayeed commented 3 years ago

Thanks, that workaround solves things for me.

hermanashley commented 1 year ago

I am running into a similar error, but here it is not handling string data.

File "/ashley/.cache/pypoetry/virtualenvs/test-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 54, in ProtoFromDataFrames
    table_entries[col] = self.NdarrayToEntry(table[col])
  File "/ashley/.cache/pypoetry/virtualenvs/test-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 119, in NdarrayToEntry
    data_type = self.DtypeToType(x.dtype)
  File "/ashley/.cache/pypoetry/virtualenvs/test-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 66, in DtypeToType
    if dtype.char in np.typecodes['AllFloat']:
AttributeError: 'StringDtype' object has no attribute 'char'

This is using python 3.8, pandas 1.4, and facets-overview 1.0.0

Would appreciate some help!

jameswex commented 1 year ago

The facets code is quite old and doesn't contain support for the newer StringDtype for string values. If you instead use the standard "object" type for the strings, the code should work.

hermanashley commented 1 year ago

@jameswex Thank you! I had to convert Int64Dtype as well it turned out. Possibly this belongs in another thread, but I am seeing a new error after doing type conversion:

    proto_str = GenericFeatureStatisticsGenerator().ProtoFromDataFrames(dfs).SerializeToString()
  File "/ashley/.cache/pypoetry/virtualenvs/scorecard-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 60, in ProtoFromDataFrames
    return self.GetDatasetsProto(
  File "/ashley/.cache/pypoetry/virtualenvs/scorecard-BIYvDDBt-py3.8/lib/python3.8/site-packages/facets_overview/base_generic_feature_statistics_generator.py", line 284, in GetDatasetsProto
    sample_count=np.asscalar(val[0]),
  File "/ashley/.cache/pypoetry/virtualenvs/scorecard-BIYvDDBt-py3.8/lib64/python3.8/site-packages/numpy/__init__.py", line 311, in __getattr__
    raise AttributeError("module {!r} has no attribute "
AttributeError: module 'numpy' has no attribute 'asscalar'

Any insight?

jameswex commented 1 year ago

I believe it has to do with your numpy version. See https://numpy.org/doc/1.21/reference/generated/numpy.asscalar.html

You can downgrade numpy or update the facets code to use the appropriate replacement method.