uniqueHelp: accumulator based implementation, call stack size safe

elm-community / list-extra

Convenience functions for working with List.

http://package.elm-lang.org/packages/elm-community/list-extra/latest

MIT License

135 stars 59 forks source link

uniqueHelp: accumulator based implementation, call stack size safe #77

Closed andys8 closed 7 years ago

andys8 commented 7 years ago

Modified the uniqueHelp implementation to use an accumulator. This prevents the function to crash on very large lists. I added a test to make sure list order is preserved correctly.

Fixes #61

michaeljones commented 7 years ago

Hey, great to see this, I'm just trying to understand though... it is still recursive but it works because it is tail recursive and the elm compiler handles that in a manner that doesn't grow the stack?

andys8 commented 7 years ago

@michaeljones To be honest I'm not sure if tail call optimization works due to js or elm. https://lisplog.org/elm_converts_tail_calls_to_loops.html http://2ality.com/2015/06/tail-call-optimization.html

Update

Evan explains it's part of the elm compiler: https://evancz.gitbooks.io/functional-programming-in-elm/recursion/tail-call-elimination.html

Chadtech commented 7 years ago

Before I merge this, am I reading the benchmarks properly, that this is performing 30% better than the existing version? I ask because @andys8 and @pzp1997 each did bench marks and said it performed about as well, which seems to be underselling this change. Furthermore, in the issue related to this PR, the discussion revolved around the trade-off between speed and stack safety. But as I see it, this change is faster and safer, so there is no trade-off. I just want to make sure I understand whats at question. Am I on the same page?

EDIT: I see that the trade off discussion was in a different issue from the one revolving around uniqueBy, but with the same discussants on similar problems.

andys8 commented 7 years ago

@Chadtech This would be the benchmark comparison for this PR (tco, accumulator based) and the current implementation: https://ellie-app.com/3P2Jy8sVYhTa1/0

Performance would get be influenced depending on the test somewhere between +6.92% and -10.08%. But the main difference is this:

Chadtech commented 7 years ago

Ah okay, I see. So its about as performant over all, but 10% worse on small lists and 7% better on big lists.

If you dont mind I'll let this PR stay open for a while longer before merging it. It wont show up in the package until the next major release anyway. I wish I knew how normal it is to use uniqueBy on >100,000 item lists. If anyone could share any detailed use cases that would be much appreciated.

andys8 commented 7 years ago

I can try to make up samples. Like this one: https://ellie-app.com/3P3D8HPhGnna1/0

The usecase with 100.000 Ints is made up. With records (users, posts, comments for example) the stack size will exceed in a range of 1.000 - 10.000 items. I could also think of sensor data, geo locations, log messages and more.

Chadtech commented 7 years ago

Yeah sorry, when I say "use case", I mean a broader context of what you are doing. Could you share the code in your project? What is the project where you are getting this error? Without mentioning data types or computer errors, what are you trying to do?

In my mind the following are relevant considerations: 0 What if @andys8 's use case isnt normal, and the common case requires performance on lists of low length? Then avoiding the stack overflow is a less good option.

1 What if in @andys8 's project, it turns out that there is a different and arguably better way to do whatever it is he is trying to do? Then picking a side in this trade off is not necessary.

Both of these considerations can be explored by looking at how people actually try and use these functions. The simple fact that you can or did run into a certain kind of technical error isnt informative to how list-extra should be.

andys8 commented 7 years ago

Hey @Chadtech, I'm sorry, I can't provide detailed information because it's a customer project. We're handling geographical traces (lists of positions).

The points you mentioned are quite interesting. But is this PR a trade-off? The performance is round about the same and won't crash ;) But I've no problem if you decide to not merge the changes.