alteryx / featuretools

An open source python library for automated feature engineering
https://www.featuretools.com
BSD 3-Clause "New" or "Revised" License
7.25k stars 879 forks source link

“IndexError: Too many levels” when running Featuretools dfs after upgrade #252

Closed jrkinley-zz closed 6 years ago

jrkinley-zz commented 6 years ago

Featuretools' dfs() method fails to run on my entity set after upgrading from v0.1.21 to v0.2.x and v0.3.0.

The error is raised when the Pandas backend tries to calculate the aggregate features _calculate_agg_features(). In particular:

--> 442 to_merge.reset_index(1, drop=True, inplace=True) ... IndexError: Too many levels: Index has only 1 level, not 2

This is working fine in v0.1.x and the entity set hasn't changed after the upgrade. The entity set is composed of 7 entities and 6 relationships. Each entity (dataframe) is added via entity_from_dataframe.

jrkinley-zz commented 6 years ago

pip freeze

alabaster==0.7.7 anaconda-client==1.4.0 anaconda-navigator==1.1.0 argcomplete==1.0.0 astropy==1.1.2 Babel==2.2.0 backports-abc==0.4 backports.shutil-get-terminal-size==1.0.0 backports.ssl-match-hostname==3.4.0.2 beautifulsoup4==4.4.1 bitarray==0.8.1 blaze==0.9.1 bokeh==0.11.1 boto==2.39.0 boto3==1.7.40 botocore==1.10.40 Bottleneck==1.0.0 cdecimal==2.3 cdsw==1.0.0 cffi==1.5.2 chest==0.2.3 click==6.7 cloudpickle==0.5.3 clyent==1.2.1 colorama==0.3.7 conda==4.0.5 conda-build==1.20.0 conda-env==2.4.5 conda-manager==0.3.1 configobj==5.0.6 cryptography==1.3 cycler==0.10.0 Cython==0.25.2 cytoolz==0.7.5 dask==0.19.1 datashape==0.5.1 decorator==4.3.0 dill==0.2.4 distributed==1.23.1 docopt==0.6.2 docutils==0.12 dynd==0.7.3.dev1 enum34==1.1.6 et-xmlfile==1.0.1 fastcache==1.0.2 featuretools==0.1.21 Flask==0.12 Flask-Cors==2.1.2 funcsigs==0.4 functools32==3.2.3.post2 future==0.16.0 futures==3.2.0 fuzzywuzzy==0.16.0 gevent==1.1.0 greenlet==0.4.9 grin==1.2.1 h5py==2.5.0 hdfs==2.1.0 HeapDict==1.0.0 ibis==1.6.0 ibis-framework==0.13.0 idna==2.0 impala==0.2 impyla==0.14.1 ipaddress==1.0.14 ipykernel==4.3.1 ipython==5.1.0 ipython-genutils==0.2.0 ipywidgets==4.1.1 itsdangerous==0.24 jdcal==1.2 jedi==0.9.0 Jinja2==2.10 jmespath==0.9.3 jsonschema==2.4.0 jupyter==1.0.0 jupyter-client==4.2.2 jupyter-console==4.1.1 jupyter-core==4.1.0 kudu-python==1.2.0 llvmlite==0.9.0 locket==0.2.0 lxml==3.6.0 MarkupSafe==1.0 matplotlib==2.0.0 mistune==0.7.2 mpmath==0.19 msgpack==0.5.6 multipledispatch==0.4.8 nbconvert==4.1.0 nbformat==4.4.0 networkx==1.11 nltk==3.2 nose==1.3.7 notebook==4.1.0 numba==0.24.0 numexpr==2.5 numpy==1.14.5 odo==0.4.2 openpyxl==2.3.2 pandas==0.23.1 pandas-datareader==0.2.1 partd==0.3.2 path.py==0.0.0 pathlib2==2.3.2 patsy==0.4.0 pep8==1.7.0 pexpect==4.6.0 pickleshare==0.7.4 Pillow==3.1.1 plotly==2.5.1 ply==3.8 prompt-toolkit==1.0.15 psutil==4.1.0 ptyprocess==0.5.2 py==1.4.31 py4j==0.10.7 pyasn1==0.1.9 pycairo==1.10.0 pycosat==0.6.1 pycparser==2.14 pycrypto==2.6.1 pycurl==7.19.5.3 pyflakes==1.1.0 Pygments==2.2.0 Pympler==0.5 pyOpenSSL==0.15.1 pyparsing==2.2.0 pytest==2.8.5 python-dateutil==2.7.3 python-Levenshtein==0.12.0 pytz==2018.4 PyYAML==3.12 pyzmq==15.2.0 QtAwesome==0.3.2 qtconsole==4.2.0 QtPy==1.0 redis==2.10.3 regex==2018.2.21 requests==2.13.0 requests-file==1.4.3 rope==0.9.4 s3fs==0.1.5 s3transfer==0.1.13 sasl==0.2.1 scandir==1.7 scikit-image==0.12.3 scikit-learn==0.19.1 scipy==1.1.0 seaborn==0.8 simplegeneric==0.8.1 simplejson==3.10.0 singledispatch==3.4.0.3 six==1.11.0 snowballstemmer==1.2.1 sockjs-tornado==1.0.1 sortedcontainers==2.0.5 sphinx-rtd-theme==0.1.9 spyder==2.3.8 SQLAlchemy==1.0.12 statsmodels==0.6.1 subprocess32==3.5.2 sympy==1.0 tables==3.2.2 tblib==1.3.2 terminado==0.5 thrift==0.9.3 thrift-sasl==0.2.1 thriftpy==0.3.9 toolz==0.9.0 tornado==5.1 tqdm==4.23.4 traitlets==4.3.2 unicodecsv==0.14.1 wcwidth==0.1.7 Werkzeug==0.14.1 xlrd==0.9.4 XlsxWriter==0.8.4 xlwt==1.0.0 zict==0.1.3

jrkinley-zz commented 6 years ago

Unfortunately I can't share any code or data, but to give you an idea the entity set is composed of 7 entities and 6 relationships that share a common join key. The first 2 dataframes have a unique index that I specify when calling entity_from_dataframe. The other 5 dataframes don't have a unique index column so I specify both index and make_index when calling entity_from_dataframe. This works ok in v0.1.21.

I don't think I'm doing anything out of the ordinary when calling dfs. I specify a couple of seed features, pass it the cutoff dates, and make use of drop_contains and drop_exact.

If I'm interpreting the error correctly, it looks like _calculate_agg_features expects the underlying dataframes to have a multi-level index, given to_merge.reset_index(1, drop=True, inplace=True) and its failing because the specific dataframe has only a single level 0.

kmax12 commented 6 years ago

@jrkinley I cannot reproduce, but after looking at the code, I was able to refactor our implementation to not require the reset index in #250, which results in cleaner code and may resolve your problem.

Can you try to install that branch of featuretools and run your code? You can install that branch using pip with this command

pip install -e git://github.com/featuretools/featuretools.git@clean-agg-merge#egg=featuretools

Let us know if it helps!

jrkinley-zz commented 6 years ago

@kmax12 Thanks for the patch. Unfortunately it results in another error:

> ... > calculate_feature_matrix.py in calc_results (316) > pandas_backend.py in calculate_all_features (196) > pandas_backend.py in _calculate_agg_features (486) > ... KeyError: u'TREND(<entity>.<variable>, <time_index>)'

featuretools_issues_252_keyerror.txt

jrkinley-zz commented 6 years ago

@kmax12, your change appears to have got past the point where the first IndexError was thrown. The new KeyError is being thrown when checking if any of the features in the dataframe are of boolean type. At this point the feature in question appears to be missing.

... frame[f.get_name()].dtype.name in ['object', 'bool']): ...

kmax12 commented 6 years ago

@jrkinley thanks looking into this. if you remove the trend primitive does the error still occur? can you tell if other features are missing?

alexelgier commented 6 years ago

I am having a similar problem, I was getting the error "IndexError: Too many levels: Index has only 1 level, not 2", and after installing this branch, am getting KeyError on a TIME_SINCE_PREVIOUS feature. Removing TIME_SINCE_PREVIOUS from the primitives I'm using didn't help as I started getting KeyError with TIME_SINCE_LAST, after removing that one, I started getting KeyError on TREND.

Any help would be apreciated as it seems this isn't just happening to me.

kmax12 commented 6 years ago

@alexelgier thanks. we are able to reproduce and are looking into it now

kmax12 commented 6 years ago

@jrkinley @alexelgier can you try the branch handle-empty-baseframe and see if it solves your problem?

thanks again for helping us out!

pip install -e git://github.com/featuretools/featuretools.git@handle-empty-baseframe#egg=featuretools

alexelgier commented 6 years ago

I've installed the branch and am running the code currently, will let you know if it works =) Thanks so much for the quick response

alexelgier commented 6 years ago

We're no longer getting the IndexError nor the KeyError, but now we're getting an AttributeError:

 File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/computational_backends/calculate_feature_matrix.py", line 258, in calculate_feature_matrix     pass_columns=pass_columns)   File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/computational_backends/calculate_feature_matrix.py", line 520, in linear_calculate_chunks     backend=backend)   File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/computational_backends/calculate_feature_matrix.py", line 342, in calculate_chunk     training_window=window)   File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/computational_backends/utils.py", line 34, in wrapped     r = method(*args, *kwargs)   File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/computational_backends/calculate_feature_matrix.py", line 316, in calc_results     profile=profile)   File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/computational_backends/pandas_backend.py", line 196, in calculate_all_features     result_frame = handler(group, input_frames)   File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/computational_backends/pandas_backend.py", line 313, in _calculate_transform_features     values = feature_func(variable_data)   File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/primitives/transform_primitive.py", line 207, in pd_diff     return grouped_df[bf_name].apply(lambda x: x.total_seconds())   File "/home/mlgroup/NRM/venv/lib/python3.6/site-packages/pandas/core/series.py", line 3194, in apply     mapped = lib.map_infer(values, f, convert=convert_dtype)   File "pandas/_libs/src/inference.pyx", line 1472, in pandas._libs.lib.map_infer   File "/home/mlgroup/NRM/venv/src/featuretools/featuretools/primitives/transform_primitive.py", line 207, in     return grouped_df[bf_name].apply(lambda x: x.total_seconds()) AttributeError: 'float' object has no attribute 'total_seconds'

kmax12 commented 6 years ago

@alexelgier this issue looks like you have an incorrect underlying datatype for the datetime column used by a TimeSincePrevious feature. Can you check that your time index in each entity is all datetimes and has no nan values?

jrkinley-zz commented 6 years ago

@kmax12 the branch is working for me. Thanks for your help!

alexelgier commented 6 years ago

I've checked the EntitySet and the data seems ok. Is there any other reason I might be getting this error?

kmax12 commented 6 years ago

@alexelgier can you share your data or a reproducible example? you can email us at help@featuretools.com.

alexelgier commented 6 years ago

Sadly because of legal issues I cannot share the data I'm working on.

I've checked the EntitySet and all the time indexes in my entities are of type datetime_time_index and have no missing values.

Is there any other reason I might be getting this error? Perhaps you could further suggest how I could debug this.

kmax12 commented 6 years ago

@alexelgier the problem here appears to be with the TimeSincePrevious primitive. Can you open another issue for this discussion?

alexelgier commented 6 years ago

Will do. Thanks for the help!

pabloazurduy commented 5 years ago

Hi, is this branch still available ?

Did not find branch or tag 'handle-empty-baseframe', assuming revision or ref.
error: pathspec 'handle-empty-baseframe' did not match any file(s) known to git.
kmax12 commented 5 years ago

@pabloazurduy it's been merged into master, so it should be in the latest release v0.5.1 of featuretools. if you're hitting an error still, please open a new issue