EntilZha / PyFunctional

Python library for creating data pipelines with chain functional programming
http://pyfunctional.pedro.ai
MIT License
2.41k stars 132 forks source link

Unexpected Sequence wrapping from grouped #158

Closed evanw2 closed 3 years ago

evanw2 commented 3 years ago

When using 'grouped' on a list of lists, the inner lists are converted into Sequence, but only for the first one in the group. This breaks another library I was using that expects the elements of the batch being sent to be a list specifically.

Example:

from functional import seq
s = [[i,i+1] for i in range(10)]
for batch in seq(s).grouped(3):
    for item in batch:
        print(f'{item} -- {type(item)}')

Output:

[0, 1] -- <class 'functional.pipeline.Sequence'>
[1, 2] -- <class 'list'>
[2, 3] -- <class 'list'>
[3, 4] -- <class 'functional.pipeline.Sequence'>
[4, 5] -- <class 'list'>
[5, 6] -- <class 'list'>
[6, 7] -- <class 'functional.pipeline.Sequence'>
[7, 8] -- <class 'list'>
[8, 9] -- <class 'list'>
[9, 10] -- <class 'functional.pipeline.Sequence'>
EntilZha commented 3 years ago

At minimum, either all the groups should be wrapped or shouldn't be wrapped. Any thoughts on why that might be happening (I'll be busy for next few weeks so not much time to debug)?

The intention was to wrap sub-groups when I wrote this initially, but there should at least be an option to disable that behavior. I'd welcome a pull request that adds that (whether its a global option or a keyword arg to grouped

evanw2 commented 3 years ago

Agreed that at a minimum, it should be more consistent in the wrapping, either with all the group elements or none. To me it seems like better default behavior to not attempt to wrap the elements at all. It sounds like the intention was to wrap the batch itself, not the elements of the batch, which makes more sense to me than wrapping the batch elements.

I tried to trace it a bit. Appears that the critical section of code is here: https://github.com/EntilZha/PyFunctional/blob/master/functional/transformations.py#L579

    iterator = iter(sequence)
    try:
        while True:
            batch = islice(iterator, size)
            yield list(chain((wrap(next(batch)),), batch))

So, if I understand correctly, the batch is selected with islice. We get the first element of the batch with next, wrap it with the wrapping function, then rejoin that first element with the rest of the batch by chaining them together again. If the intention is to wrap the batch itself, it seems like pulling the first element of the iterator and chaining it together again is unnecessary, and perhaps it should just be the following instead:

            yield wrap(list(batch))

I was looking at other issues, and saw that maybe there was some other reason for pulling off the first element of the batch iterator here: https://github.com/EntilZha/PyFunctional/issues/120#issuecomment-353800792

I might try to make a PR for this later this week or weekend.

evanw2 commented 3 years ago

I reviewed the file history a bit to better understand the current implementation.

https://github.com/EntilZha/PyFunctional/commit/8ce33b1dec7c11d7f4c09d36f83fec9586d51908

In the above commit, the implementation was refactored to use iterators instead of list slices. It appears this was to avoid calling len on the sequence, since in some cases one might not want to traverse the full sequence in advance to determine the length. The previous implementation before this commit seems to make it clear though that the wrap function was only ever intended to wrap the batch itself, not any of the elements of the batch.

https://github.com/abybaddi009/PyFunctional/commit/c2ef2f028d812fd712f60590657b87e520771c9a

In issue #120, someone was surprised that the batches were returned as itertools.chain, so the batch was wrapped in a list.

From the above, I think it's clear that if anything, the wrap function should only be applied to the outer group, not the elements. Wrapping is sort of in conflict with the previous change of returning as a list though. My preference would be to keep the return type as a list, and to not wrap at all in the grouped method.

I have a branch here that might be a reasonable fix, let me know what you think: https://github.com/evanw2/PyFunctional/tree/pr-issue-158

I didn't have a good reason for swapping the nesting of the while and the try/except. Will swap that back.

EntilZha commented 3 years ago

Thanks for the detective work. Not wrapping them seems reasonable to me. Originally I had intended to wrap sub-groups in functions like these so that you could do seq.range(10).grouped(3).map(lambda x: x.map(lambda n: n + 1)), but I think in retrospect it would be better to force explicitness, so force the seq(x).map.... Otherwise there isn't a way to opt out and its unclear how deep down that path to go.

I think I'd be good with a PR that passes (possibly modified) tests, adds another test to make sure that the returned groups are of type list, and just confirm it doesn't cause the full list to be pre-computed ahead of time (this could just be showing terminal output for a map function with side effects, like printing).

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stephan-rayner commented 3 years ago

FYI: I am going to take a shot at this today. I originally started working on this before realising that @evanw2 had already suggested a similar change to mine.

I concur with @evanw2 the issue is right here

When call giving _wrap a list the function converts that list into a sequence. It seems as though in the implementation linked we are only wrapping the first element of the group and since that is a list it is being converted to a sequence. You can see this if you take the original example and change the grouping value.

 4 yields:
 [0, 1] -- <class 'functional.pipeline.Sequence'>
[1, 2] -- <class 'list'>
[2, 3] -- <class 'list'>
[3, 4] -- <class 'list'>
[4, 5] -- <class 'functional.pipeline.Sequence'>
[5, 6] -- <class 'list'>
[6, 7] -- <class 'list'>
[7, 8] -- <class 'list'>
[8, 9] -- <class 'functional.pipeline.Sequence'>
[9, 10] -- <class 'list'>

2 yields:
[0, 1] -- <class 'functional.pipeline.Sequence'>
[1, 2] -- <class 'list'>
[2, 3] -- <class 'functional.pipeline.Sequence'>
[3, 4] -- <class 'list'>
[4, 5] -- <class 'functional.pipeline.Sequence'>
[5, 6] -- <class 'list'>
[6, 7] -- <class 'functional.pipeline.Sequence'>
[7, 8] -- <class 'list'>
[8, 9] -- <class 'functional.pipeline.Sequence'>
[9, 10] -- <class 'list'>

This doesn't occur for sets Interesting note I believe this may have been the same behaviour

I put a few test cases together they will be part of the PR.

PASS: [None for i in range(10)],
PASS: [i for i in range(10)],
PASS: [{i} for i in range(10)],
PASS: [{i, i+1} for i in range(10)],
FAIL: [[i, i+1] for i in range(10)]
FAIL: [[i] for i in range(10)],

A pass here is when all elements retain their type after the grouped method is applied to the sequence.

Should have that PR up soon :slightly_smiling_face: .

EntilZha commented 3 years ago

Thanks for all the great work, merged the related PR #160 so closing the issue