fsprojects / FSharpx.Extras

Functional programming and other utilities from the original "fsharpx" project
https://fsprojects.github.io/FSharpx.Extras/
The Unlicense
682 stars 146 forks source link

HashMap collection #200

Closed ovatsus closed 9 years ago

ovatsus commented 11 years ago

The builtin Map collection is implemented as a tree and that requires that the items are not only equatable but also comparable, and sometimes that's a problem, forcing us to use the mutable Dictionary<,> instead. A HashMap with exactly the same api of Map but without those requirements would be nice

jackfoxy commented 11 years ago

You could probably wrap a hashtable in something close to "exactly the same api", but making a hashtable immutable will have to overcome some technical challenges. Someone better acquainted with .NET internals may think of other problems, but the big one that comes to my mind is the hidden hashtable is a single object to the GC, and once it is > 85k it gets put on the large object heap http://msdn.microsoft.com/en-us/magazine/cc534993.aspx Constantly creating and GCing objects on the LOH, as immutable structures do, will most likely adversely affect your application.

Have you done a literature search to see if there are any bright ideas on implementing immutable hash tables?

ovatsus commented 11 years ago

Why would the hashtable be considered a single object to the GC? I'm no expert in data structures, but I was thinking that an hashtable implemented with an immutable vector instead of an array to hold the buckets would do the trick. Anyone knows how scala.collection.immutable.HashMap is implemented?

jackfoxy commented 11 years ago

Well, then you have thought this through. I was thinking of just a "off the shelf" hashtable, but if you construct your own from the Fsharpx.Collections.Vector, for instance, this might work out well.

BTW, I just added "append" to Vector last night, which might come in handy for your project, but I have not uploaded it yet. I could not implement as fast a version as I hoped for because I ran into this restriction, and did not research the reasons for it any further:

member internal this.EnsureEditable() =
    if !root.Thread = Thread.CurrentThread then () else
     if !root.Thread <> null then
        failwith "Transient used by non-owner thread"
     failwith "Transient used after persistent! call"

but now that it may be useful for both of us, perhaps @forki can shed some light on this restriction. I was trying to construct a new Transient from the first existing Persistent so I would only have to conj from the second Persistent to the Transient. Instead I have to conj from both Vectors to a Transient like this:

let append (vectorA : 'T Vector) (vectorB : 'T Vector) = 
let mutable ret = TransientVector()
    for i in 0..(vectorA.Length - 1) do
        ret <- ret.conj vectorA.[i]
    for i in 0..(vectorB.Length - 1) do
        ret <- ret.conj vectorB.[i]
    ret.persistent() 
forki commented 11 years ago

This code comes directly from Clojure.Core. IIRC the idea is a long as we are on the same thread and nobody else has read our object we can mutate it without any problems.

forki commented 11 years ago

See https://github.com/clojure/clojure/blob/master/src/jvm/clojure/lang/PersistentVector.java#L458

jackfoxy commented 11 years ago

That makes sense. My original idea was bad, but I can try something else for speeding up the first vector

jackfoxy commented 11 years ago

@ovatsus I like your idea of staying close to the Map api. I think map-like collections that are promoted to Core.Collections should adhere to the MS.FSharp.Collections.Map function names and api where it makes sense. When I can find the time I want to review Experimental.IntMap for name compliance. IntMap has a few other issues I want to clean-up before promoting it to Core.Collections. It has unnecessary function bindings and unsafe functions where the documentation says "you better know what you are doing". That's fine for the Experimental project, but I think not for Core.

GregRos commented 11 years ago

Why would the hashtable be considered a single object to the GC? I'm no expert in data structures, but I was thinking that an hashtable implemented with an immutable vector instead of an array to hold the buckets would do the trick. Anyone knows how scala.collection.immutable.HashMap is implemented?

It's implemented as a hash array mapped trie. Basically, imagine the FSharpx...Vector implementation except that instead of the index, we use the hash of the object (this is similar to Haskell's implementation, except that Haskell uses proper tree nodes rather than arrays).

I'm not entirely sure how a HAMT would perform against the IntMap implementation, which (correct me if I'm wrong) is a patricia trie. I think it would be quite a bit faster, and close to Dictionary<_,_> performance for certain things, if written well. But I could be mistaken, since it also has drawbacks.

I implemented a HAMT with the added bonus of path compression but I never finished it. I did finish a HAMT-based vector, though. If anyone is interested I can upload the code, or even try to implement it again. I also wrote some rather neat general hashing and bitwise manipulation utilities that might be useful, though they haven't been tested properly.

I was also toying with the idea of using a lazy array-based implementation for the trie nodes instead of regular arrays. The idea was that instead of copying arrays every time an update occurs, operations would be added to a queue. When the specific vector was queried in a way that forced the evaluation of the queued operations, it would perform the copy, and switch its internal state in a completely safe way (a specific way that doesn't use blocking but is innately safe).

I also have an entire library of bitwise operations lying around, so if that's useful I can upload that too.

jackfoxy commented 11 years ago

@GregRos Submit pull requests for any structures you think might be interesting to Collections.Experimental. Include test cases for Collections.Experimental.Tests. If you think any belong in Core.Collections https://github.com/fsharp/fsharpx/blob/master/src/FSharpx.Core/Collections/readme.md then by all means.

jdh30 commented 11 years ago

Ø I'm not entirely sure how a HAMT would perform against the IntMap implementation, which (correct me if I'm wrong) is a patricia trie. I think it would be quite a bit faster, and close to Dictionary<,> performance for certain things, if written well. But I could be mistaken, since it also has drawbacks.

HAMT can be significantly faster than IntMap in some cases but, in general, nothing comes close to Dictionary and HashSet.

Cheers,

Jon.

From: GregRos [mailto:notifications@github.com] Sent: 12 February 2013 01:40 To: fsharp/fsharpx Subject: Re: [fsharpx] HashMap collection (#200)

Why would the hashtable be considered a single object to the GC? I'm no expert in data structures, but I was thinking that an hashtable implemented with an immutable vector instead of an array to hold the buckets would do the trick. Anyone knows how scala.collection.immutable.HashMap is implemented?

It's implemented as a hash array mapped trie. Basically, imagine the FSharpx...Vector implementation except that instead of the index, we use the hash of the object (this is similar to Haskell's implementation, except that Haskell uses proper tree nodes rather than arrays).

I'm not entirely sure how a HAMT would perform against the IntMap implementation, which (correct me if I'm wrong) is a patricia trie. I think it would be quite a bit faster, and close to Dictionary<,> performance for certain things, if written well. But I could be mistaken, since it also has drawbacks.

I implemented a HAMT with the added bonus of path compression but I never finished it. If anyone is interested I can upload the code, or even try to implement it again. I also wrote some rather neat general hashing and bitwise manipulation utilities that might be useful, though they haven't been tested properly.

I was also toying with the idea of using a lazy array-based implementation for the trie nodes instead of regular arrays. The idea was that instead of copying arrays every time an update occurs, operations would be added to a queue. When the specific vector was queried in a way that forced the evaluation of the queued operations, it would perform the copy, and switch its internal state in a completely safe way (a specific way that doesn't use blocking but is innately safe).

I also have an entire library of bitwise operations lying around, so if that's useful I can upload that too.

— Reply to this email directly or view it on GitHub https://github.com/fsharp/fsharpx/issues/200#issuecomment-13414099 .

Image removed by sender.

fsgit commented 9 years ago

Closing as the FSharpx collections have moved to https://github.com/fsprojects/FSharpx.Collections/. Please reopen there if necessary,