fsprojects / FSharpx.Extras

Functional programming and other utilities from the original "fsharpx" project
https://fsprojects.github.io/FSharpx.Extras/
The Unlicense
683 stars 146 forks source link

List.length performance problem etc #208

Closed GregRos closed 11 years ago

GregRos commented 11 years ago

I've noticed that there are lots of cases in the code where List.length is called in data structures as though it were a trivial function.

The thing is, FSharpList does not store its length. Every call to List.length or List.Length (that is, the property) is O(n). When it appears in methods such as Equals the performance hit can be severe.

Because the length property of a list needs to be referenced in so many cases, I think we should find a way to create a list data structure that plays nicely with the stock List, but also as Length and other properties.

Besides this, I think it would be best if every data structure had an O(1) length property. This lack forces other developers to store the size of the data structure in external objects. It's also an intrinsic quality of a persistent data structure.

mausch commented 11 years ago

IIRC we did have a related discussion about this (can't seem to find it), where we decided to have non-O(1) operations as methods instead of properties (in particular, Length), to avoid confusion.

Storing the length in every data structure would be a space/speed tradeoff, and I'm not sure it's worth it. For example, in my experience most of the time you process lists recursively or using higher-order functions and you simply don't care about the list length. See also http://stackoverflow.com/questions/8197655/why-doesnt-the-scala-list-have-a-size-field . In F# you'd also have to give up the standard integrated list syntax, which is a pretty big deal for usability.

I also disagree about this being an intrinsic quality of persistent data structures, as you can't have O(1) length for lazy data structures as far as I know.

But I certainly agree that we should look into optimizing the data structures that call List.length, either by rethinking their operations to avoid calling it, or by using a different underlying data structure (e.g. an array).

jackfoxy commented 11 years ago

Agree with @mausch on the space/speed tradeoff. There are plenty of examples in experimental of doing it both ways. I was pretty meticulous about comparing measured performance in choosing the final configuration for the linear structures I finally implemented in Core.Collections (at least for the cases I measured, but there are many different cases I did not measure). I think I achieved O(1) or O(log n) for length without space trade-off (or excessive space trade-off) on most of them (but not LazyList, of course). But I'm happy to be shown where I erred.

Optimizing-out the use of List.length when list is integral to the internal structure will result in a different (maybe better) internal structure, or will impact performance somewhere else. No free lunches.

One great thing about OSS is when your use case stresses the code at a point the original author did not optimize for, you can optimize it for your use case.

GregRos commented 11 years ago

Let me put it like this. The operation of finding a list's length is used often enough in data structures that there needs to be a standard implementation of a list that stores its length for use as a building block. This could also be achieved by some clever use of the FSharpList perhaps.

There is no real space/time tradeoff here, and it's not a free lunch. The length of a list is an intrinsic quality of the data structure in the sense that the data structure is immutable. Its length is fixed from conception. By failing to cache it when it is necessary you are simply throwing your lunch away, or discarding information you have already acquired.

Storing the length of a list involves storing 4 bytes or so of additional data. Even for a list with a million elements, there will be absolutely no space impact even in overly strict scenarios, especially compared with other data associated with each node (e.g. references to objects that are significantly larger than 4 bytes and require maintenance by GC)

In contrast, if List.length is called with any frequency, especially in a data structure, this would degrade its performance to O(n) or higher, no matter what else it is doing. It would simply take up 99% of running time. This is exactly what I saw when I profiled some of my own data structure code that used List.length.

In languages where it is safely possible to memoize and associate data with objects in a transparent fashion, this really isn't a problem. In .NET however, the only fast and reliable way to associate data with objects involves per-instance fields, in some form or another.

EHotwagner commented 11 years ago

But wouldn't this make all operations that generate new lists (filter,fold...) slower, or would have to be reimplemented, even tough you most likely won't need the length. Personally i dont use List.length very often.

On Tue, Feb 12, 2013 at 9:09 AM, GregRos notifications@github.com wrote:

Let me put it like this. The operation of finding a list's length is used often enough in data structures that there needs to be a standard implementation of a list that stores its length for use as a building block.

There is no real space/time tradeoff here, and it's not a free lunch. The length of a list is an intrinsic quality of the data structure in the sense that the data structure is immutable. Its length is fixed from conception. By failing to cache it when it is necessary you are simply throwing your lunch away, or discarding information you have already acquired.

Storing the length of a list involves storing 4 bytes or so of additional data. Even for a list with a million elements, there will be absolutely no space impact even in overly strict scenarios, especially compared with other data associated with each node (e.g. references to objects that are significantly larger than 4 bytes and require maintenance by GC)

In contrast, if List.length is called with any frequency, especially in a data structure, this would degrade its performance to O(n) or higher, no matter what else it is doing. It would simply take up 99% of running time. This is exactly what I saw when I profiled some of my own data structure code that used List.length.

In languages where it is safely possible to memoize and associate data with objects in a transparent fashion, this really isn't a problem. In .NET however, the only fast and reliable way to associate data with objects involves per-instance fields, in some form or another.

Re-implementing list for use as a building block will have the mild inconvenience that it is no longer possible to use list comprehensions and the like. However, you can still make the syntax look fairly clean.

— Reply to this email directly or view it on GitHubhttps://github.com/fsharp/fsharpx/issues/208#issuecomment-13422008.

GregRos commented 11 years ago

I'm specifically talking about cases where the function is used frequently, and mainly in data structures where it can can have quadratic impact on performance. And also in cases when it's used in methods like Equals that are supposed to be trivial.

Here is a rough sketch of an idea I've had that might also be a nifty, general purpose feature. It really would require implementing all the list processing functions from scratch, but it might have enough use cases to merit it.

Basically it's a pattern for a data structure that can be used to cache arbitrary information, whether lazily or not, in a declerative way. For example, it can be used to cache the length of a list, its greatest element, metadata, etc. It can be easily extended and tailored to different use cases and all of these can share almost all functionality. https://gist.github.com/GregRos/4761246

It is similar in spirit to how some functional languages use monads and implicit memoization for the same purpose (e.g. if anyone is familiar, these appear in Okasaki's Purely Functional Data Structures).

jdh30 commented 11 years ago

List of the kind used in F# have been around for almost 40 years now. You’ll notice that Standard ML, OCaml and Haskell do not store the length in every node of the list. Idiomatically these lists are used for tiny collections, often empty ones. Things like function arguments in metaprograms. Not arbitrarily-long collections. And List.length is very rarely used in those languages.

F# has copied this (and lots of other stuff). If you have different requirements then I’d suggest using a different data structure. However, what you are doing is quite new in the grand scheme of things so I’d suggest thinking about a lot more than just O(1) length. For example, is the assumption that you’ll be doing a lot of maps, filters and folds? Maybe you want O(1) construction from a ResizeArray (e.g. to support fast filter)? Maybe you want the underlying representation to always be an array decorated with a bitvector conveying the presence or absence of each element. Maybe you want to store a “list” of value types as unboxed as possible? And so on…

FWIW, you can retrofit metadata onto an existing object in .NET using a weak hash table but I would not recommend that in this case.

Cheers,

Jon.

From: GregRos [mailto:notifications@github.com] Sent: 12 February 2013 08:10 To: fsharp/fsharpx Subject: Re: [fsharpx] List.length performance problem etc (#208)

Let me put it like this. The operation of finding a list's length is used often enough in data structures that there needs to be a standard implementation of a list that stores its length for use as a building block.

There is no real space/time tradeoff here, and it's not a free lunch. The length of a list is an intrinsic quality of the data structure in the sense that the data structure is immutable. Its length is fixed from conception. By failing to cache it when it is necessary you are simply throwing your lunch away, or discarding information you have already acquired.

Storing the length of a list involves storing 4 bytes or so of additional data. Even for a list with a million elements, there will be absolutely no space impact even in overly strict scenarios, especially compared with other data associated with each node (e.g. references to objects that are significantly larger than 4 bytes and require maintenance by GC)

In contrast, if List.length is called with any frequency, especially in a data structure, this would degrade its performance to O(n) or higher, no matter what else it is doing. It would simply take up 99% of running time. This is exactly what I saw when I profiled some of my own data structure code that used List.length.

In languages where it is safely possible to memoize and associate data with objects in a transparent fashion, this really isn't a problem. In .NET however, the only fast and reliable way to associate data with objects involves per-instance fields, in some form or another.

Re-implementing list for use as a building block will have the mild inconvenience that it is no longer possible to use list comprehensions and the like. However, you can still make the syntax look fairly clean.

— Reply to this email directly or view it on GitHub https://github.com/fsharp/fsharpx/issues/208#issuecomment-13422008 .

Image removed by sender.

EHotwagner commented 11 years ago

It might be also interesting to see this problem in context with zippers since then you already have a wrapping datastructure for additional information.

On Tue, Feb 12, 2013 at 12:39 PM, Jon Harrop notifications@github.comwrote:

List of the kind used in F# have been around for almost 40 years now. You’ll notice that Standard ML, OCaml and Haskell do not store the length in every node of the list. Idiomatically these lists are used for tiny collections, often empty ones. Things like function arguments in metaprograms. Not arbitrarily-long collections. And List.length is very rarely used in those languages.

F# has copied this (and lots of other stuff). If you have different requirements then I’d suggest using a different data structure. However, what you are doing is quite new in the grand scheme of things so I’d suggest thinking about a lot more than just O(1) length. For example, is the assumption that you’ll be doing a lot of maps, filters and folds? Maybe you want O(1) construction from a ResizeArray (e.g. to support fast filter)? Maybe you want the underlying representation to always be an array decorated with a bitvector conveying the presence or absence of each element. Maybe you want to store a “list” of value types as unboxed as possible? And so on…

FWIW, you can retrofit metadata onto an existing object in .NET using a weak hash table but I would not recommend that in this case.

Cheers,

Jon.

From: GregRos [mailto:notifications@github.com] Sent: 12 February 2013 08:10 To: fsharp/fsharpx Subject: Re: [fsharpx] List.length performance problem etc (#208)

Let me put it like this. The operation of finding a list's length is used often enough in data structures that there needs to be a standard implementation of a list that stores its length for use as a building block.

There is no real space/time tradeoff here, and it's not a free lunch. The length of a list is an intrinsic quality of the data structure in the sense that the data structure is immutable. Its length is fixed from conception. By failing to cache it when it is necessary you are simply throwing your lunch away, or discarding information you have already acquired.

Storing the length of a list involves storing 4 bytes or so of additional data. Even for a list with a million elements, there will be absolutely no space impact even in overly strict scenarios, especially compared with other data associated with each node (e.g. references to objects that are significantly larger than 4 bytes and require maintenance by GC)

In contrast, if List.length is called with any frequency, especially in a data structure, this would degrade its performance to O(n) or higher, no matter what else it is doing. It would simply take up 99% of running time. This is exactly what I saw when I profiled some of my own data structure code that used List.length.

In languages where it is safely possible to memoize and associate data with objects in a transparent fashion, this really isn't a problem. In .NET however, the only fast and reliable way to associate data with objects involves per-instance fields, in some form or another.

Re-implementing list for use as a building block will have the mild inconvenience that it is no longer possible to use list comprehensions and the like. However, you can still make the syntax look fairly clean.

— Reply to this email directly or view it on GitHub < https://github.com/fsharp/fsharpx/issues/208#issuecomment-13422008> .

Image removed by sender.

— Reply to this email directly or view it on GitHubhttps://github.com/fsharp/fsharpx/issues/208#issuecomment-13428767.

jdh30 commented 11 years ago

Ø IIRC we did have a related discussion about this (can't seem to find it), where we decided to have non-O(1) operations as methods instead of properties (in particular, Length), to avoid confusion.

But O(ln n) is practically the same as O(1)?

Cheers,

Jon.

From: Mauricio Scheffer [mailto:notifications@github.com] Sent: 12 February 2013 03:45 To: fsharp/fsharpx Subject: Re: [fsharpx] List.length performance problem etc (#208)

IIRC we did have a related discussion about this (can't seem to find it), where we decided to have non-O(1) operations as methods instead of properties (in particular, Length), to avoid confusion.

Storing the length in every data structure would be a space/speed tradeoff, and I'm not sure it's worth it. For example, in my experience most of the time you process lists recursively or using higher-order functions and you simply don't care about the list length. See also http://stackoverflow.com/questions/8197655/why-doesnt-the-scala-list-have-a-size-field . In F# you'd also have to give up the standard integrated list syntax, which is a pretty big deal for usability.

I also disagree about this being an intrinsic quality of persistent data structures, as you can't have O(1) length for lazy data structures as far as I know.

But I certainly agree that we should look into optimizing the data structures that call List.length, either by rethinking their operations to avoid calling it, or by using a different underlying data structure (e.g. an array).

— Reply to this email directly or view it on GitHub https://github.com/fsharp/fsharpx/issues/208#issuecomment-13416898 .

Image removed by sender.

GregRos commented 11 years ago

However, what you are doing is quite new in the grand scheme of things so I’d suggest thinking about a lot more than just O(1) length. For example, is the assumption that you’ll be doing a lot of maps, filters and folds? Maybe you want O(1) construction from a ResizeArray (e.g. to support fast filter)? Maybe you want the underlying representation to always be an array decorated with a bitvector conveying the presence or absence of each element. Maybe you want to store a “list” of value types as unboxed as possible? And so on…

I'm not working on assumptions though. It is a fact that many data structures that are currently in the repository need to know the length of a linked list in order to function (no pun intended). It is a structural requirement. In fact, it is difficult to find a data structure that uses lists (which many do) and does not have some problem related to finding their length.

Again, I'm talking about data structures currently part of FSharpx. I've actually compiled a list of offenders if you want to know the specifics.

Some of them report incorrect complexity under the false assumption that List.length is O(1) (this is the assumption of .NET in general; I also thought this and was rather surprised. I had to dig through a lot of hidden code to find the problem...). Others need to store the length of the list as part of their own state which is a structural problem, encapsulation problem, and simply makes the data structure a lot more complex than it ought to be. Since lists aren't even used as anything but building blocks, and since many data structures cache lengths anyway, there is never an issue of space complexity, however tiny that might be. The lengths are stored anyway, it's just messier.

Other functional languages have constructs that allow them to annotate objects or memoize operations in a way that is tantamount to caching. For example, finger trees can be priority queues, random access deques, or heaps depending on which additional structural information is stored in each node. Without this information they wouldn't be half as useful.

In fact, that was the inspiration for my solution above. It doesn't force you to cache length if you don't want to, and it allows you to cache other things easily. I'll try to develop it a bit more to demonstrate what I mean.

As for lazy data structures, the length operation can be improved substnatially beyond O(n) best case.

Again, I'm not saying it is the law that all users of F# must know the length of a linked list by heart, on pain of death. I'm not even suggesting that users be exposed to these lists. I'm just talking about their use as building blocks (or bootstraps) for larger data structures.

jackfoxy commented 11 years ago

@GregRos I suggest you pick a structure from Core.Collections you think this will have an impact on and create a structure in Collections.Experimental incorporating your ideas. Then we can run some comparisons.

GregRos commented 11 years ago

Here is a usage of inline functions, type inference, and static member constraints to create a module of generalized list processing functions that work on arbitrary list-like structures (basically, types that support some basic static and instance methods). I provide several differently annotated, custom list data structures to give a general idea of how it works. One is annotated with length, another with the sum of its integer elements, and a third one not annotated at all.

https://gist.github.com/GregRos/4962120

The cool thing is that, unlike when using polymorphism, this pattern allows one to call static methods and return strongly typed instances of the object you're using. The generalized List'.map function doesn't have a return type of AbstractList or something; it returns the same object you put in it.

Functions like List'.ofSeq create instances of objects without knowing their type. They work using type inference. One can write, for example,

let xs = List'.ofSeq {0 .. 100000} : MeasuredList<_>

The type coercion tells the compiler that xs is of the specified type.

Static member constraints aren't that often used that I've seen, but they can be extremely flexible. They also don't hurt performance (almost) at all, even when compared to F#'s list, which is more heavily optimized than I gave it credit for.

I'll give a demonstration of some data structures that can benefit from these differently annotated lists. Of course, the same problem can always be solved by somehow caching this information externally. It just becomes a lot cleaner when the list does it for you.

I'll also give a demonstration of the difference between using a custom enumerator object and using list and sequence comprehensions/iterators.

Unfortunately, this pattern cannot be completely transparent without cost, in terms of performance. To be completely transparent, the union cases of the list (which expose the annotation, as well as the list itself) need to be private. This would mean that an active pattern would need to be defined to use pattern matching capabilities.

Active patterns involve creation of hidden objects and perform a lot slower.

I think it would be interesting to see how this sort of pattern might be used with other data structures.

jackfoxy commented 11 years ago

@GregRos You should post your work to this discussion group as well https://groups.google.com/forum/?hl=en&fromgroups#!forum/pragmatic-functional-programming-research

Unfortunately my time constraints prevent me from taking a closer look at your work right now, but I encourage you to try implementing some experimental list-like structures and let's see how they compare to the counterparts in collections. I've been using https://github.com/jackfoxy/DS_Benchmark for benchmarking structures.

ghost commented 11 years ago

This is an interesting discussion, but it's probably better to close it as an active FSharpx issue - it's not actually a bug, is it?

ghost commented 11 years ago

Closing, reopen if necessary.