Open tomjaguarpaw opened 2 years ago
[N.B. conduit doesn't have quadratic performance. I made a mistake when setting up the benchmarks: https://github.com/tomjaguarpaw/streaming-benchmark/commit/2a965a790c23c759c70ac580b25c59fd22ae922c]
For what it's worth, this is mentioned in the tutorial: https://hackage.haskell.org/package/pipes-4.3.16/docs/Pipes-Tutorial.html#g:10
Here's an old ticket about it: https://github.com/Gabriel439/Haskell-Pipes-Library/issues/100
Thanks for pointing that out. My comments:
The tell-tale sign of the issue is the use of "List done wrong" (i.e. a monad action returning a list).
I don't agree with this (the same claim is made in the tutorial). There's no monadic action returning a list in my example.
conduit does not do this optimization either, but is currently the mostly highly used streaming library, so I don't think this flaw is as serious as you make it out to be.
This claim is no longer true. Michael Snoyman corrected me about it this morning.
I do not agree with the argument that just because an uninformed user can potentially write code with poor time complexity that it is a serious bug. If I did, then I would also conclude that we should abandon all linked lists for difference lists because uninformed Haskell beginners can (and do) write linked list code that gives quadratic time complexit
Personally I do think that we should abandon linked lists but not for difference lists, for Seq
, Vector
/Array
or for a streaming library as appropriate.
I don't think that the option of a Bind
constructor plus carefully written run
has been considered. Perhaps that would be considered less complex than Codensity
.
I don't have plans to fix this myself, but I'd accept a pull request if it didn't lead to a significant performance regression for other benchmarks
I don't have plans to fix this myself, but I'd accept a pull request if it didn't lead to a significant performance regression for other benchmarks
Which sort of change would you accept? Previously you said you would prefer not to CPS the internal representation. The alternative is to add a Bind
constructor plus carefully written run
(and perhaps carefully written other things).
Bind
+ carefully written run
, but I'd have to see the full change to be sure, since I'm not sure what other changes are entailed
Thanks Gabriella. To clarify, I don't have plans to fix this either but I think it's helpful to have the requirements spelled out like this in case anyone else wants to take it on.
The problem
Consider walking over a tree using a streaming library to do something at every leaf:
On left-skewed trees, i.e. trees that will end up generating left-associated binds, for example those generated by
leftSkewed
the performance of your streaming library is quadratic. In fact performance is quadratic under
(Neither streamly nor conduit have quadratic performance)
The plot below demonstrates the claim. The code to generate the plot is available at https://github.com/tomjaguarpaw/streaming-benchmark. The compiler was GHC 8.10.7.
The solution
Firstly, before talking about a solution, do you actually consider this a bug? I do, and I think the risk of hitting unexpected quadratic performance makes this library unsuitable for production use as it is. On the other hand maybe you have a good reason that I shouldn't think that. If so I would be interested to hear it.
If you do think this is a bug then let's think about what can be done. I know of two straightforward options:
Codensity
(streamly essentially uses handwrittenCodensity
.)In my benchmarks explicit bind comes out slightly ahead. (The examples are implemented to match
streaming
but the same techniques ought to apply to your library.)Perhaps there is another option more suitable for your library.
I welcome your thoughts.
(Related to https://github.com/Gabriel439/Haskell-Pipes-Library/issues/131)