seq() converts pandas DataFrame into Sequence

EntilZha / PyFunctional

Python library for creating data pipelines with chain functional programming

http://pyfunctional.pedro.ai

MIT License

2.39k stars 132 forks source link

seq() converts pandas DataFrame into Sequence #168

Open Arshaku opened 2 years ago

Arshaku commented 2 years ago

from functional import seq
from pandas import DataFrame

df = DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
s = seq([df])
el = s.first()
print(type(el))

this code prints: "<class 'functional.pipeline.Sequence'>" but the expected output is "<class 'pandas.core.frame.DataFrame'>"

EntilZha commented 2 years ago

This is related to https://github.com/EntilZha/PyFunctional/issues/158, where the root issue is that for convenience I originally decided to wrap a little to aggressively. I think the fix would be in two steps: (1) add a configurable option to not wrap elements with default to wrap (2) bump version to 2.X and make default to not wrap to avoid breakage. I'd be open to a PR that does this.

reklanirs commented 2 years ago

Similar issue duing reduce with the lastest master.

Expected:

from functools import reduce

reduce(lambda x,y: x.add(y), [df,df])

Out[1]:
  A     B
0  24.0  14.0
1   8.0   4.0

In fact:

seq([df,df]).reduce(lambda x,y: x.add(y))

Out[2]:
[array([24., 14.]), array([8., 4.])]

to_pandas can give some help but the column names will be missing:

seq([df,df]).reduce(lambda x,y: x.add(y)).to_pandas()

Out[3]:
      0     1
0  24.0  14.0
1   8.0   4.0

Hope it can be fixed

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Arshaku commented 2 years ago

Hi, any chance this issue will be fixed soon?

EntilZha commented 2 years ago

I don't have the bandwidth to contribute fixes myself right now, I'd welcome/review pull requests that fix it roughly how previously outlined.

swiergot commented 1 year ago

@EntilZha Any reason why __getitem__() also wraps?

EntilZha commented 1 year ago

The reason I originally did it this way is the same, I wanted it to be easy to do something like:

In [1]: from functional import seq

In [2]: seq.range(10).grouped(3)[0].map(lambda x: x * 2)
Out[2]: [0, 2, 4]

In [3]: type(seq.range(10).grouped(3)[0].map(lambda x: x * 2))
Out[3]: functional.pipeline.Sequence

As I mentioned in my prior comments, in retrospect, this has three issues: (1) there is no way to configure the behavior, namely disable it, (2) even if it were configurable, I think its probably incorrect to make the default in most cases to wrap, it probably should do that more sparingly, and (3) changing this is a breaking change, likely requiring a move to 2.x.

I'd welcome/review PRs that would fix this, but don't have the time to do it myself right now. If you are interested, I can outline how I'd do this in a little more detail.

Thanks!

swiergot commented 1 year ago

@reklanirs

In fact:

seq([df,df]).reduce(lambda x,y: x.add(y))

Out[2]:
[array([24., 14.]), array([8., 4.])]

This is a different problem caused by the fact that PyFunctional has a special handling for DataFrame - for some reason it extracts values from it.

to_pandas can give some help but the column names will be missing:
seq([df,df]).reduce(lambda x,y: x.add(y)).to_pandas()

Out[3]:
      0     1
0  24.0  14.0
1   8.0   4.0

How about this:

>>> seq(reduce(lambda x,y: x.add(y), [df,df])).to_pandas(df.columns)
   col1  col2
0     2     8
1     4    10
2     6    12