Faster fast-merge - Githubissues

joinr commented 4 years ago

I was independently working on something similar, albeit with a focus on improving clojure.core/merge performance. Watching Tommi's talk reminded me of the slowness (and I end up using it in a lot of legacy code on hot paths...). I tried just optimizing the persistent map-based variant, which got me into the same ballpark you referenced (~30%). Going with transients and a useful heuristic yields some better fruit though (~50%)...

I did some testing to eliminate sources of slowdown:

prefer discrete args vs. the default var-args for everything version,
use transients instead of persistent map-based conj,
use direct method invocation everywhere,
accumulate keys from r->l instead of l->r, since we can prune more (in some cases, e.g. 2-arg versions like (merge {:a 2 :b 3 :c 4} {:a 1 :b 1 :c 1}) we can exclude all keys from l since they already appear in R, leading to 0 actual assoc's)

(defn rmerge! [^clojure.lang.IKVReduce l  r]
  (.kvreduce l
             (fn [^clojure.lang.ITransientAssociative acc k v]
               (if-not (acc k)
                 (.assoc acc k v)
                 acc)) r))

;;~50% faster.
(defn fast-merge
  ([] {})
  ([m] m)
  ([m1 m2]          (rmerge m1 m2))
  ([m1 m2 m3]       (->> (transient m3) (rmerge! m2) (rmerge! m1) persistent!))
  ([m1 m2 m3 m4]    (->> (transient m4) (rmerge! m3) (rmerge! m2) (rmerge! m1) persistent!))
  ([m1 m2 m3 m4 m5] (->> (transient m5) (rmerge! m4) (rmerge! m3) (rmerge! m2) (rmerge! m1) persistent!))
  ([m1 m2 m3 m4 m5 & ms]
   (let [rs (reverse ms)]
     (->> (reduce rmerge! (transient (first rs)) (rest rs))
          (rmerge! m5)
          (rmerge! m4)
          (rmerge! m3)
          (rmerge! m3)
          (rmerge! m1)
          persistent!))))

bsless commented 4 years ago

Nice! This is really interesting! Are you interested in adding it via a PR? If not, do I have your permission to add it?

joinr commented 4 years ago

Feel free to add away, just reference me in the comment if you do please. Maybe it helps others find a common repository of optimized functions. You should maybe benchmark as well to make sure my results are consistent as claimed.

I'm compiling some stuff specifically for low level perf (really just regaining performance that's left on the floor in clojure.core, primarily due to overly polymorphic code and naive implementations) while optimizing a couple of different projects (and incorporating old lessons), like the various performance optimization against rust here and little libs like structural.

Another area that sucks, fyi, is the default clojure.core/memoize. Can be sped up significantly with a concurrent mutable hashmap, and if you have a discrete number of args, you can get it way faster with a specialized function that uses multiple hashmaps vs. the stock & args variadic version.

bsless commented 4 years ago

Will do, expect it in the coming week :)

And now I realize where I recognized your name from. I was really surprised to see structural, like some form of converging evolution in front of our eyes. Perhaps we should cooperate further regarding performance.

Finally regarding memoize, yes and every place where apply is used.

I'll also have a look at the other repo you linked, seems interesting. Thanks.

bsless commented 4 years ago

brief update: Looks like your suggestion is faster under certain conditions, if not all. I'm reworking the benchmarks ns to get higher quality results and analysis, then it'll get in to the next release

bsless commented 4 years ago

Added to latest version :)

joinr commented 4 years ago

I just noticed in recent testing that the original code I submitted didn't invoke transient and rmerge! correctly for the 2-arity version. No problem though, because your implementation does! Good job :)

joinr commented 4 years ago

Just ran into the myopic realization that I assumed input supports IKVReduce without a fallback..

(defn rmerge! [l  r]
  (if (instance? clojure.lang.IKVReduce l)
    (.kvreduce ^clojure.lang.IKVReduce l (fn [^clojure.lang.ITransientAssociative acc k v]
                                           (if-not (acc k)
                                             (.assoc acc k v)
                                             acc)) r)
    (clojure.core.protocols/kv-reduce l (fn [^clojure.lang.ITransientAssociative acc k v]
                                          (if-not (acc k)
                                            (.assoc acc k v)
                                            acc)) r)))

This version will do a (meager) instance check to see if we can use the fast path, and fall back to the protocol invocation (e.g. for user defined types) otherwise.

bsless commented 4 years ago

Good catch. I reopened the issue and will close it again once I add it. Thank you :)

joinr commented 4 years ago

Very curious to see tmerge apparently getting beat by fast-merge since they're very close implementation-wise (excepting the transient usage, and the reversed order of input for map processing). I tmerge to dominate (as it did in my tests with effectively a similar implementation of fast-map-merge). Interesting.

joinr commented 4 years ago

From a cursory examination of your benchmarks, your tests are generating random maps via keyword? as a sample generator. Unless I'm mistaken. If this is the case, then you would be guaranteeing unique keys for map merges across maps. I believe this basically eliminates the performance advantages that you'd see in rmerge! and consequently tmerge to an extent, since they are able to ignore associng pre-existing keys going in reverse order. Instead, there's a bit of overhead introduced due to reversing and transient/persistent calls, particularly for smaller maps, where the stock left-to-right merge wouldn't have these. Perhaps another map sample generator would be informative, where merged maps share a subset of the same keys (perhaps a variable amount).

Could be that a better (specific) name for the reverse-optimized merge operation would be collapse. Or perhaps reverse-merge heh.

joinr commented 4 years ago

Caught another possible issue with your map generation code, dunno if it matters necessarily (likely only matters for the int? case).

When you generate random maps via randmap, the distribution over the integer key range will apparently double sample keys, so your map size isn't constant. Not sure if this a big problem in practice, or if you intend to have varying map sizes (this probably will not occur with the string and keyword cases, for example).

clj-fast.bench> (frequencies (map count (repeatedly 1000 #(randmap int? 1000) )))
{789 12,
 765 17,
 798 1,
 779 28,
 774 38,
 769 26,
 799 2,
 756 11,
 806 1,
 770 42,
 777 33,
 788 9,
 773 45,
 752 3,
 797 4,
 763 20,
 805 1,
 743 1,
 776 36,
 755 15,
 758 18,
 759 23,
 749 6,
 781 18,
 796 7,
 785 20,
 782 21,
 792 6,
 761 23,
 794 4,
 790 4,
 784 15,
 800 1,
 739 1,
 768 40,
 778 27,
 775 46,
 760 19,
 803 1,
 783 23,
 772 46,
 757 13,
 786 8,
 787 14,
 771 32,
 751 7,
 802 2,
 793 4,
 746 5,
 753 9,
 766 26,
 795 4,
 762 27,
 747 3,
 767 43,
 736 1,
 748 4,
 750 4,
 780 34,
 754 11,
 791 3,
 744 3,
 764 29}

joinr commented 4 years ago

Existing implementation of tmerge in clj-fast was inefficient. removed superflous instance checks and let binding, gets the performance down to fast-map-merge equivalence.

joinr commented 4 years ago

Added a test to the benchmarks that has parameter-driven proportional map key sharing for the maps (e.g. 10%, up to 100% of the key width are shared across the random maps). Baseline profiling with 50% shared keys shows tmerge dominating as expected. Will submit a PR for analysis in a bit.

joinr commented 4 years ago

Lol, now getting counter intuitive results where pruning existing keys should obviously dominate relative to the tradeoff of associng everything naively. Dissecting and profiling.

bsless commented 4 years ago

I also started reworking the map generation code. I don't really care about the maps being random besides when testing merge, then your tmerge dominates when they're similar. Old logic with !, vs. new "dumb" logic:

(defonce -s "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
(defonce -strs (for [a -s b -s c -s d -s e -s] (str a b c d e)))
(defonce -kws (map keyword -strs))
(defonce -syms (map symbol -strs))
(defonce r (range))

(def gens
  {:keyword? (fn [n] (take n -kws))
   :map? (fn [n] (drop 1 (gen/sample (s/gen map?) (inc n))))
   :string? (fn [n] (take n -strs))
   :int? (fn [n] (take n r))
   :integer? (fn [n] (take n r))
   :symbol? (fn [n] (take n -syms))})

(defn genn!
  [n spec]
  (drop 1 (gen/sample (s/gen spec) (inc n))))

(defn genn
  [n p]
  ((gens p) n))

(defn randmap!
  ([n]
   (randmap! keyword? n))
  ([p n]
   (into {} (genn! n (s/tuple p p)))))

(defn randmap
  #_([n]
   (randmap keyword? n))
  ([p n]
   (let [coll (genn n p)]
     (zipmap coll coll))))

(defonce mrandmap (memoize randmap))

(declare mrand-nested-map)
(defn rand-nested-map
  [p width depth]
  (if (= 1 depth)
    (mrandmap p width)
    (zipmap (genn width p)
            (repeat width (mrand-nested-map p width (dec depth))))))
#_(defn rand-nested-map
  [p width depth]
  (if (= 1 depth)
    (mrandmap p width)
    (zipmap (genn width p)
            (repeat width (mrand-nested-map p width (dec depth))))))

(defonce mrand-nested-map (memoize rand-nested-map))

(def preds identity)
(def preds!
  {:int? int?
   :keyword? keyword?
   :string? string?
   :map? map?
   :symbol? symbol?})

bsless commented 4 years ago

Another option is instead of generating a bounded number, generate an unbounded sequence, then take n from dedupe of the collection

joinr commented 4 years ago

@bsless Nice. For reference, here's what I did for the shared map case...

(defn shared-random-maps [proportion n p width]
  (let [shared-width    (long (* proportion width))
        variable-width  (- width shared-width)
        shared-map      (randmap p shared-width)
        random-shared-map (fn []
                            (zipmap (keys shared-map)
                                    (genn shared-width p)))]
  (repeatedly n #(merge (randmap p variable-width) (random-shared-map)))))

(defn bench-merge-shared
  [proportion max-log-size nmaps]
  (vec
   (for [e (range 1 (inc max-log-size))
         n (range 1 (inc nmaps))
         pk *types*
         :let [width (int (Math/pow 10 e))
               p (preds pk)
               ms (shared-random-maps proportion n p width)]
         k [:merge :inline-merge :inline-fast-map-merge :inline-tmerge]
         :let [f (merge-fns k)
               _ (println 'BENCH k 'WIDTH 10 'e e '* n 'TYPE pk)
               res (apply f n ms)
               mn (->ns (mean res))
               ratio (int (/ mn n))]]
     {:bench k
      :mean mn
      :width e
      :proportion-shared proportion
      :type pk
      :ratio ratio
      :keys n
      :heap @max-memory
      :gc @gcs})))

I also refactored the rmerge! a tad (the instance check does appear to help, so it's back in....)

(defn fast-assoc! [^clojure.lang.ITransientAssociative acc k v]
  (if-not (acc k)
    (.assoc acc k v)
    acc))

(defn rmerge!
  "Returns a transient map that consists of the second of the maps assoc-ed
  onto the first. If a key occurs in more than one map, the mapping from
  te latter (left-to-right) will be the mapping in the result."
  [l  r]
  (if (instance? clojure.lang.IKVReduce l)
    (.kvreduce ^clojure.lang.IKVReduce l  fast-assoc! r)
    (p/kv-reduce l  fast-assoc! r)))

I backed off when I started checking test cases to investigate specific ones that were counter intuitive. There shouldn't really be a situation (in my mind) where the maps have any similarity, where tmerge doesn't wipe the floor with the faster-map-merge since you're cutting out mulitple assoc's and not having to do all those copies. I noticed some instances where this did not hold true (even when dominant, not dominant by as much as it should be)

clj-fast.bench> (let [[m1 m2 m3 m4] (take 4 (repeat (zipmap (range 100) (range 100))))]
                  (criterium.core/quick-bench (inline/tmerge m1 m2 m3 m4)))
Evaluation count : 44772 in 6 samples of 7462 calls.
             Execution time mean : 13.524963 µs
    Execution time std-deviation : 260.329780 ns
   Execution time lower quantile : 13.239612 µs ( 2.5%)
   Execution time upper quantile : 13.762569 µs (97.5%)
                   Overhead used : 1.824294 ns
nil
clj-fast.bench> (let [[m1 m2 m3 m4] (take 4 (repeat (zipmap (range 100) (range 100))))]
                  (criterium.core/quick-bench (inline/fast-map-merge m1 m2 m3 m4)))
Evaluation count : 36882 in 6 samples of 6147 calls.
             Execution time mean : 15.978887 µs
    Execution time std-deviation : 277.302647 ns
   Execution time lower quantile : 15.615046 µs ( 2.5%)
   Execution time upper quantile : 16.290658 µs (97.5%)
                   Overhead used : 1.824294 ns
nil

In this case, the tmerge should be reducing over the hundred entries and doing 0 assocs. We're only 15% faster in total. It looks like the cost savings for looking up a single item to determine if it's already present and avoid associng is ~27%. Perhaps the transient/persistent barrier is costing (maybe going mutable and coercing to persistent will be faster, dunno).

bsless commented 4 years ago

There are probably a ton of subtleties we need to be aware of. A few come to mind:

The transient threshold. There should be a number of operations above which using transients is faster, and below which it's slower. It may depend on several factors, but we definitely pay for the back-forth.
Memory layout: The HAMT nodes' layout in memory might affect iteration / modification speeds. Wide vs. deep.
Relative hash cost: Even when measuring the % gain of tmerge for similar maps, the cost of get is not uniform across types.
get vs. find vs. contains: using (acc k) as a predicate can fail if the value is false or nil (although it shouldn't be, we have no guarantee). Idiomatically we should use find but it constructs a MapEntry pair, which leaves contains?, which is probably what we should use. It calls valAt and compares to a sentinel object.

joinr commented 4 years ago

Yes, those are all good observations. There's definitely a threshold where arraymaps dominate (very fast copy and lookup).

Random observation: looks like keyword keyed maps are about 85% faster on single key lookup than numeric (maybe due to non-identity based eq) fascinating.

joinr commented 4 years ago

The contains vs. function invocation one is appropriate. Another option is entryAt maybe.

joinr commented 4 years ago

Since if you're doing contains? now, you also have to repeat the effort to lookup the value, so 2x lookups. I think entryAt would mitigate that cost.

joinr commented 4 years ago

I'm getting apparently stochastic JITing too during the benchmarks. First time I run them, I get counter intuitive results, second time, the dominance flips toward expected. Very weird, since I figure the sampling from criterium is driving a decent warmup period.

joinr commented 4 years ago

There are also some obvious bailout cases, like single-arity merge, where it doesn't make sense to transient at all (or reduce).

joinr commented 4 years ago

fascinating, looks like entryAt is pretty performant too, despite allocating.

bsless commented 4 years ago

How exactly are you benching? At this point I mistrust anything that wasn't run with direct-linked uberjar and full bench (not quick, JIT not accurate enough). Maybe we can get the best of both worlds by mimicking contains ourselves by leveraging the not-found arity. Define a ns-global sentinel, write a closure over it which does identity based check if an object equals to it. Now you're doing one access.

bsless commented 4 years ago

Just for reference, when I run the entire test suit I just leave it overnight. Takes a good while.

joinr commented 4 years ago

I've been trimming down to just the comparative runs for faster-map-merge and tmerge to get a better feedback loop. I also committed the sin of having the browser open (big no no), so fine-grained measures can be problematic.

joinr commented 4 years ago

Looks like there's a ~13% penalty of going through find to get that polymorphism.

joinr commented 4 years ago

How exactly are you benching? At this point I mistrust anything that wasn't run with direct-linked uberjar and full bench (not quick, JIT not accurate enough). Maybe we can get the best of both worlds by mimicking contains ourselves by leveraging the not-found arity. Define a ns-global sentinel, write a closure over it which does identity based check if an object equals to it. Now you're doing one access.

Sorry, I missed this message. I'm going 2 different ways. First is using the bench function API from the REPL, just for the merge bench (and the new shared map one I pasted). I lifted the merge types into a dynvar so I can control which ones are run (I only care about the two right now). I'm also only doing the 10e1 case, but all 4 map calls.

Then, I'm doing similar focused measures using criterium on fixed maps instead of the random, to get a better sense of what's going on. I have the browser open, which from past experience is definitely a perf hog. So this is all "runtime" benchmarking, and subject to known problems like JIT finding a local optimum and other stuff.

bsless commented 4 years ago

Do you know of a way to profile the JIT itself on hotspot? I know some proprietary implementations come with tools which allow dumping the generated assembly and analyzing the compiled code paths.

joinr commented 4 years ago

hotspot is magic so far as I'm concerned. I went down that rabbit hole during the icfpc code opt stuff, and learned how much I didn't know regarding inlining and profiling and all manner of stuff. I wonder if the PGO stuff from graal has any insight; you're supposed to be able to capture a dump of the profiling and use it for native image to get fast native code, although it's enterprise only.

bsless commented 4 years ago

Ever played with this tool? I have no experience with it but it was the first result on your ubiquitous search engine https://github.com/AdoptOpenJDK/jitwatch

joinr commented 4 years ago

I have not, but it looks useful.

bsless commented 4 years ago

One thing which already arises from the documentation and gives me some cause for concern regarding my entire project is that all the loop-unrolling I'm doing will cause methods to be larger and thus they'll be harder to inline. Wondering if I'm shooting myself in the foot with this.

joinr commented 4 years ago

I observed that with some stuff where I was going macro heavy. Sometimes, it's better to have stuff in a function, and let the JIT inline. Unfortunately, when this matters is not obvious.

joinr commented 4 years ago

As they say, small methods (functions for us) are aggressively inlined. So your mileage may vary. OTOH, I don't have an idea of what the performance tradeoff of manually inlining vs. bypassing the JIT is. Perhaps there are cases where it is a better strategy.

bsless commented 4 years ago

It doesn't get simpler.

Changed the generator to:

(defn genn!
  [n spec]
  (sequence
   (comp
    (drop 1)
    (distinct)
    (take n))
   (clojure.test.check.generators/sample-seq (s/gen spec))))

(defn randmap!
  ([n]
   (randmap! keyword? n))
  ([p n]
   (into {} (genn! n (s/tuple p p)))))

Now I'm getting tmerge is slower for small maps but dominates for large maps.

joinr commented 4 years ago

Coming up with a consistent performance model is surprisingly difficult. I got side tracked with some work stuff and helping with perf in another library, but that's settled down. Going to look at this again with fresh eyes.

bsless commented 4 years ago

Now I changed the input generator for merge to

(defn randmap!
  ([n]
   (randmap! keyword? n))
  ([p n]
   (let [s (genn! n p)]
     (zipmap s s))))

and tmerge is slower for that Can we draw any conclusions from that?

joinr commented 4 years ago

It would be nice to have a couple of reproducible cases to examine. Don't think into vs zipmap should affect results.

bsless commented 4 years ago

It's not into vs. zipmap, it's with (s/tuple p p) vs. without. Since distinct was seeing tuples, there was a chance distinct tuples with equal keys went through it.

bsless commented 4 years ago

Pushed a branch with the changes to have a more convenient reference: https://github.com/bsless/clj-fast/blob/bench-map-generation-rework/doc/results.md

bsless commented 3 years ago

I think with the latest release I managed to squeeze even more performance out of merge. two maps of one element now merge in ~25ns.

bsless commented 2 years ago

Another way to merge, one which is aware of type conversions

(defn merge-meh
  [m1 m2]
  (persistent!
   (reduce-kv
    assoc!
    (reduce-kv
     assoc!
     (transient clojure.lang.PersistentHashMap/EMPTY)
     m1)
    m2)))

(defn merge-ok
  [m1 m2]
  (persistent!
   (reduce-kv
    assoc!
    (transient m1)
    m2)))

(defn merge2
  [m1 m2]
  (let [c1 (count m1)]
    (if (> c1 8)
      (merge-ok m1 m2)
      (if (< 8 (unchecked-add c1 (count m2)))
        (merge-meh m1 m2)
        (merge-ok m1 m2)))))

bortexz commented 2 years ago

Turns out that when merging small maps (array-map) that need to become hash-map, this transition is quite costly in terms of performance:

(c/quick-bench (merge {:a 1 :b 6} 
                      {:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 8 :o 9 :l 10 :v 11}))

Evaluation count : 614460 in 6 samples of 102410 calls.
             Execution time mean : 974.772023 ns
    Execution time std-deviation : 2.971197 ns
   Execution time lower quantile : 970.592882 ns ( 2.5%)
   Execution time upper quantile : 977.600861 ns (97.5%)
                   Overhead used : 2.144573 ns

While, if the left map used as the source is already a PersistentHashMap, then it goes quite faster to merge into it

(c/quick-bench (merge {:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 8 :o 9 :l 10 :v 11} 
                      {:a 1 :b 6}))

Evaluation count : 4681524 in 6 samples of 780254 calls.
             Execution time mean : 138.013267 ns
    Execution time std-deviation : 15.762978 ns
   Execution time lower quantile : 126.502904 ns ( 2.5%)
   Execution time upper quantile : 159.704613 ns (97.5%)
                   Overhead used : 2.144573 ns

Taking some of the examples from this issue, I've written this merge that only accepts 2 maps, and it will use as "source" the bigger map, and iterate on the smaller one, thus on the case where you might have a PersistentHashMap and a PersistenArrayMap it will use the hashmap as source. Still no way to avoid the moment when 2 array maps create a hash map to be slow-ish.


(defn- rassoc! [^clojure.lang.ITransientAssociative acc k v]
  (if-not (acc k)
    (.assoc acc k v)
    acc))

(defn- lassoc! [^clojure.lang.ITransientAssociative acc k v]
  (.assoc acc k v))

(defn- rmerge!
  [l r]
  (if (instance? clojure.lang.IKVReduce l)
    (.kvreduce ^clojure.lang.IKVReduce l  rassoc! r)
    (clojure.core.protocols/kv-reduce l  rassoc! r)))

(defn- lmerge!
  [l r]
  (if (instance? clojure.lang.IKVReduce r)
    (.kvreduce ^clojure.lang.IKVReduce r  lassoc! l)
    (clojure.core.protocols/kv-reduce r  lassoc! l)))

(defn merge!
  "Merges two maps, slightly faster than clojure.core/merge. 
   It will always use as 'source' the bigger map, leaving the small map as
   the traversed one. It will be specially efficient if the second map is
   a hash-map and the first is an array-map, as the first map won't be 'merged'
   into the second, avoiding the transition from array-map to hash-map.
   m2 will always override keys in m1, same as clojure.core/merge"
  [m1 m2]
  (with-meta
    (if (< (count m1) (count m2))
      (persistent! (rmerge! m1 (transient m2)))
      (persistent! (lmerge! (transient m1) m2)))
    (meta m1)))

edit: forgot the merge! fn, and better formatting

joinr commented 2 years ago

@bsless I read that reduce-kv is getting some love b.t.w to be more general in clojure 1.11. Might be the best option. It could also alter some of the type-specific benchmarks.

@bortex-io Very nice. Good observation about the type coercion costing us.

bsless commented 2 years ago

@joinr we found it out while teasing the implementation out on Slack, but I never see you there :) The only weakness I dislike wrt rassoc is it incurs a lookup cost. Ideally, clojure adds a method to "update" a key with a arity 2 function, baked directly into the map implementation. You can then implement assoc with it to save on code duplication. Should also refine interesting test cases for maps merge. 4x4, 4x5, 8x1, 1x8, etc

bsless / clj-fast

Faster fast-merge #1