alteryx / featuretools

An open source python library for automated feature engineering
https://www.featuretools.com
BSD 3-Clause "New" or "Revised" License
7.25k stars 879 forks source link

Why does featuretools use first column as an index but not the pandas index field? #130

Closed alexandrnikitin closed 6 years ago

alexandrnikitin commented 6 years ago

Hi,

Pandas creates an implicit index if one isn't specified as a column. What I want to achieve is to use the pandas' index in featuretools but it can't be passed as a name in index argument. featuretools uses first column by default and that part is not clear to me. Why does featuretools use first column as an index but not the pandas index field? How to let featuretools use the index field instead?

The code: https://github.com/Featuretools/featuretools/blob/906777bbafc18892a927dfdc5ac3f3b8d40de1b5/featuretools/entityset/entityset.py#L441-L459

Seth-Rothschild commented 6 years ago

@alexandrnikitin to integrate your data with Featuretools, simple call .reset_index() on your dataframe. This will turn the index into a regular column in your dataframe. See the pandas documentation here for an example of how this works.

We treat index in the same way as other columns because it can be used just like any other columns when it comes to feature engineering. For example, we may want to apply the Count primitive to it.

Another more technical reason is that pandas indices can be a little idiomatic (e.g they support things like multiple levels) compared to the concept of a primary key in other tabular data system such as databases. To make our implementation map more generally, we made the design decision to keep it as a normal column.

I'm going to close this for now. If you have further questions on how to get this to work feel free to post on stackoverflow with the featuretools tag.