EntilZha / PyFunctional

Python library for creating data pipelines with chain functional programming
http://pyfunctional.pedro.ai
MIT License
2.41k stars 132 forks source link

for_each doesn't return the original sequence but returns None #153

Closed bmsan closed 3 years ago

bmsan commented 3 years ago

I would have expected for_each(func) to apply the func function but to also return the unmodified sequence.

Eg I would expect the following code, not to fail(as it happens in my case):

    l = (seq([1, 2, 3, 4])
    .for_each(print)
    .map(lambda x: x + 1))
   print(l)

But to produce:

1
2
3
4
[2, 3, 4, 5]

My fix: was to modify in functional/pipeline.py for_each to return self

pip freeze | grep pyfunctional
pyfunctional==1.4.2
python --version
Python 3.7.6

pyfunctional was installed today using pip install pyfunctional

I will shortly create a pull request with this functionality & test associated with it

bmsan commented 3 years ago

Created a pull request for updated for_each function and unit tests: https://github.com/EntilZha/PyFunctional/pull/154

EntilZha commented 3 years ago

In all the cases that I've seen, foreach usually doesn't return anything including itself. This is the case for both of pyfunctional's main sources of inspiration: Apache Spark (https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.foreach) and Scala collections (https://www.scala-lang.org/api/current/scala/collection/immutable/List.html#foreach%5BU%5D(f:A=%3EU):Unit). Is there a good reason to depart from this convention?

The main reason I think this is how these APIs are designed is that if you would like to apply a function and return the original collection you could do so with map like collection.map(lambda x: print(x);f(x)). However, if for_each were changed then there would not be a way to apply the value without returning anything.

bmsan commented 3 years ago

Hi @EntilZha

Thank you for your quick & thorough response. I appreciate it!

I was looking for an observer like pattern, but to be able to continue the sequence processing. That can be obtained by your map example as well, even though it will be a little bit more verbose.

Keeping the API similar to Apache Spark & Scala can be a powerful reason not change the for_each functionality.

Playing the devil's advocate, returning self instead of None doesn't produce extra computations but gives the user the ability to continue the pipeline processing if needed. So you gain flexibility, without breaking any other operation that the user could potentially use.

My main reason for filling this issue was that I wasn't treating this function as a transformation to the sequence, but as an action that doesn't have side effects over the original sequence.

Please keep in mind I am not extremely familiar with functional programming so I would understand if you'd chose not to diverge from the Apache Spark/Scala API. Cheers!

EntilZha commented 3 years ago

An alternative to consider is leaving for_each as is, but adding a new function with the semantics you're proposing. I'd rather leave for_each as is, but if this pattern is common then I could get behind having it as an additional function/api. Does this type of function have a common name in other libraries? If we were to add it, it would be great to have a name that users from other libraries expecting that API would recognize.

bmsan commented 3 years ago

After looking around at multiple languages that somewhat support functional programming, all of them return the equivalent of python None in their foreach loop. So maybe my idea would cbe onfusing for someone who is coming from a completely different language with functional support.

Also I wasn't able to find an equivalent function which returns something. The closest thing I found(but this is a completely different direction) is in Kotlin where you have apply() that applies a function to the sequence itself and returns the sequence.

seq.apply{forEach{println("$it")}}.map{it+1}

But this opens a totally different can of worms, because that apply basically creates a 'fork' of the sequence so it doesn't seem trivial to implement.

On top of my mind there could be something like "observe()/inspect()" which would do for_each + return self. But if I think at the big picture adding a new function that doesn't have a correspondent to any other language should be done only if you consider this could be useful to a wider audience than me & my current usecase :)

All in all maybe I should stick to your idea of collection.map(lambda x: print(x);f(x)) since I cannot find an equivalent in another language.

EntilZha commented 3 years ago

Perhaps a middle ground would be to use the extend decorator to give it a try for a while and if it seems worthwhile/you find other instances of it being used, then can consider adding permanently? The feature is a bit undocumented outside of the docstring and unit tests from https://github.com/EntilZha/PyFunctional/pull/144

Docs: https://github.com/EntilZha/PyFunctional/blob/291a4917304263386b720a98af29ddacaa3f5680/functional/pipeline.py#L1814

bmsan commented 3 years ago

The extend functionality seems awesome and it definitely resolves my usecase. Thanks @EntilZha ! I really appreciate your input and help here.

Kache commented 3 years ago

I would opt to have for_each return the original sequence. This allows it to play well with OO elements as well as continue the chain.

For example:

(
    seq(1, 2, 3)
        .for_each(lambda n: logging.info(n))  # just throw this into the middle of any chain
        .map(lambda n: 2 * n) 
)

(
    seq(collection_of_objects)
        .for_each(lambda foo: foo.mutate_or_something())  # mutate_or_something might return None, we don't care
        .map(lambda foo: foo.different_obj)  # keep going
)

This is extremely common in Ruby, as it's the normal behavior of Enumerable#each.