bsless / clj-fast

Unpredictably faster Clojure
Eclipse Public License 2.0
242 stars 1 forks source link

better memoize example #2

Closed joinr closed 4 years ago

joinr commented 4 years ago

This is an example (for the 2 arg version) of how discrete memoize can kick the crap out of clojure.core/memoize (which uses varargs - slow hashing - plus a hashmap - slow lookup).

If we use mutable containers in the background, and threadsafe ones, we get the same semantics but way better performance. If we eliminate inefficient hashing and use discrete args (many functions are like this), we get even better lookup than the naive clojure.core version.

I've used variants like this (I think I saw it demoe'd by Zach Tellman once) for 1-arg and 2-arg, but there's like a macro for defining arbitrary arities.

;;4x faster than clojure.core/memoize...
;;we can do better with a macro, but I haven't sussed it out.
;;This is a much as we probably need for now though, meh.
(defn memo-2 [f]
  (let [xs (java.util.concurrent.ConcurrentHashMap.)]
    (fn [x y]
      (if-let [^java.util.concurrent.ConcurrentHashMap ys (.get xs x)]
        (if-let [res (.get ys y)]
          res
          (let [res (f x y)]
            (do (.putIfAbsent ys y res)
                res)))
        (let [res     (f x y)
              ys    (doto (java.util.concurrent.ConcurrentHashMap.)
                      (.putIfAbsent y res))
              _     (.putIfAbsent xs x ys)]
          res)))))

I explored using tuples from clj-tuple for this purpose, to have a generalized variant with a simpler implementation. You get a drastic speedup over the stock clojure.core/memoize, since tuple hashing is typically pretty good. This (old) implementation uses a HashMap, where we'd probably prefer to use a concurrent map for threadsafety I guess (maybe it doesn't matter if the map gets clobbered by multiple writers though). The macro spork.util.general/memo-fn is a replacement for the (memo (fn [....])) idiom that's more efficient:

user>` (require 'spork.util.general)
nil
user> (ns spork.util.general)
nil
spork.util.general> (memo-fn [x y z] (+ x y z))
#function[spork.util.general/eval15554/memoized--14673--auto----15555]
spork.util.general> (def blah (memo-fn [x y z] (+ x y z)))
#'spork.util.general/blah
spork.util.general> (def blee (memoize (fn [x y z] (+ x y z))))
#'spork.util.general/blee
spork.util.general> (time (dotimes [i 1000000] (blah 1 2 3)))
"Elapsed time: 27.303857 msecs"
nil
spork.util.general> (time (dotimes [i 1000000] (blee 1 2 3)))
"Elapsed time: 207.389697 msecs"
nil
bsless commented 4 years ago

Thank you, tangling with memoize was a challenge on the table. I managed to hack together some implementation about 5 times faster than core.memoize but it suffers greatly from equality semantics, so if your arguments are scalars it's about 5 times faster than core.memoize, but if you have one collection it's 20% slower. Thoughts on how I can bypass that?

bsless commented 4 years ago

Success for a ConcurrentHashMap implementation as well! see here for atom version: https://github.com/bsless/clj-fast/blob/memoize/src/clj_fast/core.clj#L260 chm version: https://github.com/bsless/clj-fast/blob/memoize/src/clj_fast/core.clj#L351 (they look extremely similar as I just had to implement get-in and put-in for chm)

joinr commented 4 years ago

I would ditch the instance check in the chm get stuff (hurts perf, and you already know it's a map. I think you can organize the code to know when a chm will exist as a branch (will always have n args, so n - 1 chms). For single arity functions, is just use definline instead of going the metadata inline route (less to maintain).

Collection equality semantics are orthogonal to the map (unless you want to provide an api with custom hash fn or key, and use that). Clojure uses structural equality, which is composed of hashing collections recursively (once) and caching the result. Idiomatic clojure creates many small collections via short lived objects, so if you key off the coll, you end up recomputing hashes possibly many times based on inputs.

I haven't looked thoroughly, but seems you have a general pattern in the macro for arbitrary arg memoization. Good job

joinr commented 4 years ago

Ugh, but wrong button and accidentally closed.

bsless commented 4 years ago

I was debating the instance check with myself. Theoretically, while it does have some overhead, it's not terrible. Then the only question is if I want to protect against misuse, but when I think about the use case, it's impossible, as the chm will always be modified in the same depth, so there's no need for that. It is as a general purpose api, but that's not the problem space :smile: Thanks a lot for the feedback, ideas, recommendations and all :)

joinr commented 4 years ago

Sure, thanks for engaging in the exercise. One thing that pops to mind (not present in core memoize) would be something like core.cache, where a full chm is one example. It may be that evicting values (like lru cache) and maintaining constant space is useful. This would be an additional layer (like say user defined hash key) that core.memoize ignores.

There also may be used for primitive args, and primitive backed maps or other schemes.

joinr commented 4 years ago

I ran into the unhappy surprise that concurrenthashmap can't store nils as entries. hashmap can....hmmm.

joinr commented 4 years ago

For clojure functions (pure ones...) we're probably okay using java.util.HashMap even we have an occasional race condition - since the end result will be the same. I "think".

bsless commented 4 years ago

I don't think a race condition will be acceptable if we have side effects which we don't want to occur more than once... but that problem also occurs with a concurrenthashmap. And regarding the nils, I'm considering switching to a hashmap and putting it behind an atom. Either that or put sentinel objects instead of nils, which I also had to do when memoizing nullary functions.

joinr commented 4 years ago

Race condition I was thinking of would be the "write" variety, that is, the same value is written to the same spot in the map 2x. Not a big deal in my use case, but if the function you're memoizing has side effects, then yea, not a good thing. If you're depending on the function memoization to act as something like defonce then this makes sense; however, most functions we're memoizing don't follow this behavior (or probably shouldn't), but hey we're consenting adults.

I think wrapping the CHM in a version that uses a sentinel value for nil is probably the way to go. You'd need a little translation layer around it, but otherwise, should be feasible.

Hashmap behind an atom still incurs lookup cost (array hash map not so bad, but the trie variant over 10 or so objects is slower by about 4x I think).

bsless commented 4 years ago

Finally cleaned up the nil cases with sentinel values. Turns out the fastest option was nested CHM with sentinel value for every argument type but keyword: https://github.com/bsless/clj-fast/blob/master/doc/results.md#by-type-of-arguments

joinr commented 4 years ago

Interesting. The performance for memoiz-n is strange IMO. I haven't look at the implementation, but no idea why the map? series would be such an outlier. I haven't dug into the benchmarks though.

bsless commented 4 years ago

Because memoize-n uses a nested Clojure hash map, and hash/lookup of hashmap keys is terrible.