MLBazaar / MLPrimitives

Primitives for machine learning and data science.
https://mlbazaar.github.io/MLPrimitives
MIT License
70 stars 38 forks source link

pandas.DataFrame.resample crash when grouping by integer columns #211

Closed csala closed 4 years ago

csala commented 4 years ago

Description

When pandas.DataFrame.resample is used grouping by one or more integer columns and also resetting the index, an error arises because the integer columns already exist in the re-sampled dataframe before resetting the index.

What I Did

In [1]: import pandas as pd

In [2]: df = pd.DataFrame({
   ...: 'time': ['2010-01-01', '2010-01-02', '2010-01-03'],
   ...: 'str_id': ['a', 'b', 'c'],
   ...: 'int_id': [1, 2, 3],
   ...: 'value': [1, 2, 3]
   ...: })

In [3]: df['time'] = pd.to_datetime(df['time'])

In [4]: from mlblocks import MLBlock

In [5]: block = MLBlock('pandas.DataFrame.resample', rule='1D', on='time',
   ...:                  groupby=['str_id'], aggregation='mean', reset_index=True)

In [6]: block.produce(X=df)
Out[6]:
  str       time  int  value
0   a 2010-01-01    1      1
1   b 2010-01-02    2      2
2   c 2010-01-03    3      3

In [7]: block = MLBlock('pandas.DataFrame.resample', rule='1D', on='time',
   ...:                  groupby=['int_id'], aggregation='mean', reset_index=True)

In [8]: block.produce(X=df)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-8-2f66b517b674> in <module>
----> 1 block.produce(X=df)

...
   1147         if not allow_duplicates and item in self.items:
   1148             # Should this be a different kind of error??
-> 1149             raise ValueError('cannot insert {}, already exists'.format(item))
   1150
   1151         if not isinstance(loc, int):

ValueError: cannot insert int_id, already exists