Open joinr opened 4 years ago
Nice! This is really interesting! Are you interested in adding it via a PR? If not, do I have your permission to add it?
Feel free to add away, just reference me in the comment if you do please. Maybe it helps others find a common repository of optimized functions. You should maybe benchmark as well to make sure my results are consistent as claimed.
I'm compiling some stuff specifically for low level perf (really just regaining performance that's left on the floor in clojure.core, primarily due to overly polymorphic code and naive implementations) while optimizing a couple of different projects (and incorporating old lessons), like the various performance optimization against rust here and little libs like structural.
Another area that sucks, fyi, is the default clojure.core/memoize. Can be sped up significantly with a concurrent mutable hashmap, and if you have a discrete number of args, you can get it way faster with a specialized function that uses multiple hashmaps vs. the stock & args variadic version.
Will do, expect it in the coming week :)
And now I realize where I recognized your name from. I was really surprised to see structural
, like some form of converging evolution in front of our eyes. Perhaps we should cooperate further regarding performance.
Finally regarding memoize, yes and every place where apply
is used.
I'll also have a look at the other repo you linked, seems interesting. Thanks.
brief update: Looks like your suggestion is faster under certain conditions, if not all. I'm reworking the benchmarks ns to get higher quality results and analysis, then it'll get in to the next release
Added to latest version :)
I just noticed in recent testing that the original code I submitted didn't invoke transient and rmerge! correctly for the 2-arity version. No problem though, because your implementation does! Good job :)
Just ran into the myopic realization that I assumed input supports IKVReduce without a fallback..
(defn rmerge! [l r]
(if (instance? clojure.lang.IKVReduce l)
(.kvreduce ^clojure.lang.IKVReduce l (fn [^clojure.lang.ITransientAssociative acc k v]
(if-not (acc k)
(.assoc acc k v)
acc)) r)
(clojure.core.protocols/kv-reduce l (fn [^clojure.lang.ITransientAssociative acc k v]
(if-not (acc k)
(.assoc acc k v)
acc)) r)))
This version will do a (meager) instance check to see if we can use the fast path, and fall back to the protocol invocation (e.g. for user defined types) otherwise.
Good catch. I reopened the issue and will close it again once I add it. Thank you :)
Very curious to see tmerge
apparently getting beat by fast-merge
since they're very close implementation-wise (excepting the transient usage, and the reversed order of input for map processing). I tmerge to dominate (as it did in my tests with effectively a similar implementation of fast-map-merge
). Interesting.
From a cursory examination of your benchmarks, your tests are generating random maps via keyword?
as a sample generator. Unless I'm mistaken. If this is the case, then you would be guaranteeing unique keys for map merges across maps. I believe this basically eliminates the performance advantages that you'd see in rmerge!
and consequently tmerge
to an extent, since they are able to ignore associng pre-existing keys going in reverse order. Instead, there's a bit of overhead introduced due to reversing and transient/persistent calls, particularly for smaller maps, where the stock left-to-right merge wouldn't have these. Perhaps another map sample generator would be informative, where merged maps share a subset of the same keys (perhaps a variable amount).
Could be that a better (specific) name for the reverse-optimized merge operation would be collapse
. Or perhaps reverse-merge
heh.
Caught another possible issue with your map generation code, dunno if it matters necessarily (likely only matters for the int? case).
When you generate random maps via randmap
, the distribution over the integer key range will apparently double sample keys, so your map size isn't constant. Not sure if this a big problem in practice, or if you intend to have varying map sizes (this probably will not occur with the string and keyword cases, for example).
clj-fast.bench> (frequencies (map count (repeatedly 1000 #(randmap int? 1000) )))
{789 12,
765 17,
798 1,
779 28,
774 38,
769 26,
799 2,
756 11,
806 1,
770 42,
777 33,
788 9,
773 45,
752 3,
797 4,
763 20,
805 1,
743 1,
776 36,
755 15,
758 18,
759 23,
749 6,
781 18,
796 7,
785 20,
782 21,
792 6,
761 23,
794 4,
790 4,
784 15,
800 1,
739 1,
768 40,
778 27,
775 46,
760 19,
803 1,
783 23,
772 46,
757 13,
786 8,
787 14,
771 32,
751 7,
802 2,
793 4,
746 5,
753 9,
766 26,
795 4,
762 27,
747 3,
767 43,
736 1,
748 4,
750 4,
780 34,
754 11,
791 3,
744 3,
764 29}
Existing implementation of tmerge in clj-fast was inefficient. removed superflous instance checks and let binding, gets the performance down to fast-map-merge equivalence.
Added a test to the benchmarks that has parameter-driven proportional map key sharing for the maps (e.g. 10%, up to 100% of the key width are shared across the random maps). Baseline profiling with 50% shared keys shows tmerge dominating as expected. Will submit a PR for analysis in a bit.
Lol, now getting counter intuitive results where pruning existing keys should obviously dominate relative to the tradeoff of associng everything naively. Dissecting and profiling.
I also started reworking the map generation code. I don't really care about the maps being random besides when testing merge, then your tmerge
dominates when they're similar.
Old logic with !
, vs. new "dumb" logic:
(defonce -s "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
(defonce -strs (for [a -s b -s c -s d -s e -s] (str a b c d e)))
(defonce -kws (map keyword -strs))
(defonce -syms (map symbol -strs))
(defonce r (range))
(def gens
{:keyword? (fn [n] (take n -kws))
:map? (fn [n] (drop 1 (gen/sample (s/gen map?) (inc n))))
:string? (fn [n] (take n -strs))
:int? (fn [n] (take n r))
:integer? (fn [n] (take n r))
:symbol? (fn [n] (take n -syms))})
(defn genn!
[n spec]
(drop 1 (gen/sample (s/gen spec) (inc n))))
(defn genn
[n p]
((gens p) n))
(defn randmap!
([n]
(randmap! keyword? n))
([p n]
(into {} (genn! n (s/tuple p p)))))
(defn randmap
#_([n]
(randmap keyword? n))
([p n]
(let [coll (genn n p)]
(zipmap coll coll))))
(defonce mrandmap (memoize randmap))
(declare mrand-nested-map)
(defn rand-nested-map
[p width depth]
(if (= 1 depth)
(mrandmap p width)
(zipmap (genn width p)
(repeat width (mrand-nested-map p width (dec depth))))))
#_(defn rand-nested-map
[p width depth]
(if (= 1 depth)
(mrandmap p width)
(zipmap (genn width p)
(repeat width (mrand-nested-map p width (dec depth))))))
(defonce mrand-nested-map (memoize rand-nested-map))
(def preds identity)
(def preds!
{:int? int?
:keyword? keyword?
:string? string?
:map? map?
:symbol? symbol?})
Another option is instead of generating a bounded number, generate an unbounded sequence, then take n
from dedupe
of the collection
@bsless Nice. For reference, here's what I did for the shared map case...
(defn shared-random-maps [proportion n p width]
(let [shared-width (long (* proportion width))
variable-width (- width shared-width)
shared-map (randmap p shared-width)
random-shared-map (fn []
(zipmap (keys shared-map)
(genn shared-width p)))]
(repeatedly n #(merge (randmap p variable-width) (random-shared-map)))))
(defn bench-merge-shared
[proportion max-log-size nmaps]
(vec
(for [e (range 1 (inc max-log-size))
n (range 1 (inc nmaps))
pk *types*
:let [width (int (Math/pow 10 e))
p (preds pk)
ms (shared-random-maps proportion n p width)]
k [:merge :inline-merge :inline-fast-map-merge :inline-tmerge]
:let [f (merge-fns k)
_ (println 'BENCH k 'WIDTH 10 'e e '* n 'TYPE pk)
res (apply f n ms)
mn (->ns (mean res))
ratio (int (/ mn n))]]
{:bench k
:mean mn
:width e
:proportion-shared proportion
:type pk
:ratio ratio
:keys n
:heap @max-memory
:gc @gcs})))
I also refactored the rmerge!
a tad (the instance check does appear to help, so it's back in....)
(defn fast-assoc! [^clojure.lang.ITransientAssociative acc k v]
(if-not (acc k)
(.assoc acc k v)
acc))
(defn rmerge!
"Returns a transient map that consists of the second of the maps assoc-ed
onto the first. If a key occurs in more than one map, the mapping from
te latter (left-to-right) will be the mapping in the result."
[l r]
(if (instance? clojure.lang.IKVReduce l)
(.kvreduce ^clojure.lang.IKVReduce l fast-assoc! r)
(p/kv-reduce l fast-assoc! r)))
I backed off when I started checking test cases to investigate specific ones that were counter intuitive. There shouldn't really be a situation (in my mind) where the maps have any similarity, where tmerge
doesn't wipe the floor with the faster-map-merge
since you're cutting out mulitple assoc's and not having to do all those copies. I noticed some instances where this did not hold true (even when dominant, not dominant by as much as it should be)
clj-fast.bench> (let [[m1 m2 m3 m4] (take 4 (repeat (zipmap (range 100) (range 100))))]
(criterium.core/quick-bench (inline/tmerge m1 m2 m3 m4)))
Evaluation count : 44772 in 6 samples of 7462 calls.
Execution time mean : 13.524963 µs
Execution time std-deviation : 260.329780 ns
Execution time lower quantile : 13.239612 µs ( 2.5%)
Execution time upper quantile : 13.762569 µs (97.5%)
Overhead used : 1.824294 ns
nil
clj-fast.bench> (let [[m1 m2 m3 m4] (take 4 (repeat (zipmap (range 100) (range 100))))]
(criterium.core/quick-bench (inline/fast-map-merge m1 m2 m3 m4)))
Evaluation count : 36882 in 6 samples of 6147 calls.
Execution time mean : 15.978887 µs
Execution time std-deviation : 277.302647 ns
Execution time lower quantile : 15.615046 µs ( 2.5%)
Execution time upper quantile : 16.290658 µs (97.5%)
Overhead used : 1.824294 ns
nil
In this case, the tmerge
should be reducing over the hundred entries and doing 0 assocs
. We're only 15% faster in total. It looks like the cost savings for looking up a single item to determine if it's already present and avoid associng is ~27%. Perhaps the transient/persistent barrier is costing (maybe going mutable and coercing to persistent will be faster, dunno).
There are probably a ton of subtleties we need to be aware of. A few come to mind:
get
is not uniform across types.get
vs. find
vs. contains
: using (acc k)
as a predicate can fail if the value is false
or nil
(although it shouldn't be, we have no guarantee). Idiomatically we should use find
but it constructs a MapEntry pair, which leaves contains?
, which is probably what we should use. It calls valAt
and compares to a sentinel object.Yes, those are all good observations. There's definitely a threshold where arraymaps dominate (very fast copy and lookup).
Random observation: looks like keyword keyed maps are about 85% faster on single key lookup than numeric (maybe due to non-identity based eq) fascinating.
The contains vs. function invocation one is appropriate. Another option is entryAt maybe.
Since if you're doing contains?
now, you also have to repeat the effort to lookup the value, so 2x lookups. I think entryAt
would mitigate that cost.
I'm getting apparently stochastic JITing too during the benchmarks. First time I run them, I get counter intuitive results, second time, the dominance flips toward expected. Very weird, since I figure the sampling from criterium is driving a decent warmup period.
There are also some obvious bailout cases, like single-arity merge, where it doesn't make sense to transient at all (or reduce).
fascinating, looks like entryAt
is pretty performant too, despite allocating.
How exactly are you benching? At this point I mistrust anything that wasn't run with direct-linked uberjar and full bench (not quick, JIT not accurate enough). Maybe we can get the best of both worlds by mimicking contains ourselves by leveraging the not-found arity. Define a ns-global sentinel, write a closure over it which does identity based check if an object equals to it. Now you're doing one access.
Just for reference, when I run the entire test suit I just leave it overnight. Takes a good while.
I've been trimming down to just the comparative runs for faster-map-merge and tmerge to get a better feedback loop. I also committed the sin of having the browser open (big no no), so fine-grained measures can be problematic.
Looks like there's a ~13% penalty of going through find
to get that polymorphism.
How exactly are you benching? At this point I mistrust anything that wasn't run with direct-linked uberjar and full bench (not quick, JIT not accurate enough). Maybe we can get the best of both worlds by mimicking contains ourselves by leveraging the not-found arity. Define a ns-global sentinel, write a closure over it which does identity based check if an object equals to it. Now you're doing one access.
Sorry, I missed this message. I'm going 2 different ways. First is using the bench function API from the REPL, just for the merge bench (and the new shared map one I pasted). I lifted the merge types into a dynvar so I can control which ones are run (I only care about the two right now). I'm also only doing the 10e1 case, but all 4 map calls.
Then, I'm doing similar focused measures using criterium on fixed maps instead of the random, to get a better sense of what's going on. I have the browser open, which from past experience is definitely a perf hog. So this is all "runtime" benchmarking, and subject to known problems like JIT finding a local optimum and other stuff.
Do you know of a way to profile the JIT itself on hotspot? I know some proprietary implementations come with tools which allow dumping the generated assembly and analyzing the compiled code paths.
hotspot is magic so far as I'm concerned. I went down that rabbit hole during the icfpc code opt stuff, and learned how much I didn't know regarding inlining and profiling and all manner of stuff. I wonder if the PGO stuff from graal has any insight; you're supposed to be able to capture a dump of the profiling and use it for native image to get fast native code, although it's enterprise only.
Ever played with this tool? I have no experience with it but it was the first result on your ubiquitous search engine https://github.com/AdoptOpenJDK/jitwatch
I have not, but it looks useful.
One thing which already arises from the documentation and gives me some cause for concern regarding my entire project is that all the loop-unrolling I'm doing will cause methods to be larger and thus they'll be harder to inline. Wondering if I'm shooting myself in the foot with this.
I observed that with some stuff where I was going macro heavy. Sometimes, it's better to have stuff in a function, and let the JIT inline. Unfortunately, when this matters is not obvious.
As they say, small methods (functions for us) are aggressively inlined. So your mileage may vary. OTOH, I don't have an idea of what the performance tradeoff of manually inlining vs. bypassing the JIT is. Perhaps there are cases where it is a better strategy.
It doesn't get simpler.
Changed the generator to:
(defn genn!
[n spec]
(sequence
(comp
(drop 1)
(distinct)
(take n))
(clojure.test.check.generators/sample-seq (s/gen spec))))
(defn randmap!
([n]
(randmap! keyword? n))
([p n]
(into {} (genn! n (s/tuple p p)))))
Now I'm getting tmerge
is slower for small maps but dominates for large maps.
Coming up with a consistent performance model is surprisingly difficult. I got side tracked with some work stuff and helping with perf in another library, but that's settled down. Going to look at this again with fresh eyes.
Now I changed the input generator for merge to
(defn randmap!
([n]
(randmap! keyword? n))
([p n]
(let [s (genn! n p)]
(zipmap s s))))
and tmerge
is slower for that
Can we draw any conclusions from that?
It would be nice to have a couple of reproducible cases to examine. Don't think into vs zipmap should affect results.
It's not into vs. zipmap, it's with (s/tuple p p)
vs. without. Since distinct
was seeing tuples, there was a chance distinct tuples with equal keys went through it.
Pushed a branch with the changes to have a more convenient reference: https://github.com/bsless/clj-fast/blob/bench-map-generation-rework/doc/results.md
I think with the latest release I managed to squeeze even more performance out of merge. two maps of one element now merge in ~25ns.
Another way to merge, one which is aware of type conversions
(defn merge-meh
[m1 m2]
(persistent!
(reduce-kv
assoc!
(reduce-kv
assoc!
(transient clojure.lang.PersistentHashMap/EMPTY)
m1)
m2)))
(defn merge-ok
[m1 m2]
(persistent!
(reduce-kv
assoc!
(transient m1)
m2)))
(defn merge2
[m1 m2]
(let [c1 (count m1)]
(if (> c1 8)
(merge-ok m1 m2)
(if (< 8 (unchecked-add c1 (count m2)))
(merge-meh m1 m2)
(merge-ok m1 m2)))))
Turns out that when merging small maps (array-map) that need to become hash-map, this transition is quite costly in terms of performance:
(c/quick-bench (merge {:a 1 :b 6}
{:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 8 :o 9 :l 10 :v 11}))
Evaluation count : 614460 in 6 samples of 102410 calls.
Execution time mean : 974.772023 ns
Execution time std-deviation : 2.971197 ns
Execution time lower quantile : 970.592882 ns ( 2.5%)
Execution time upper quantile : 977.600861 ns (97.5%)
Overhead used : 2.144573 ns
While, if the left map used as the source is already a PersistentHashMap, then it goes quite faster to merge into it
(c/quick-bench (merge {:a 1 :b 2 :c 3 :d 4 :e 5 :f 6 :g 8 :o 9 :l 10 :v 11}
{:a 1 :b 6}))
Evaluation count : 4681524 in 6 samples of 780254 calls.
Execution time mean : 138.013267 ns
Execution time std-deviation : 15.762978 ns
Execution time lower quantile : 126.502904 ns ( 2.5%)
Execution time upper quantile : 159.704613 ns (97.5%)
Overhead used : 2.144573 ns
Taking some of the examples from this issue, I've written this merge that only accepts 2 maps, and it will use as "source" the bigger map, and iterate on the smaller one, thus on the case where you might have a PersistentHashMap and a PersistenArrayMap it will use the hashmap as source. Still no way to avoid the moment when 2 array maps create a hash map to be slow-ish.
(defn- rassoc! [^clojure.lang.ITransientAssociative acc k v]
(if-not (acc k)
(.assoc acc k v)
acc))
(defn- lassoc! [^clojure.lang.ITransientAssociative acc k v]
(.assoc acc k v))
(defn- rmerge!
[l r]
(if (instance? clojure.lang.IKVReduce l)
(.kvreduce ^clojure.lang.IKVReduce l rassoc! r)
(clojure.core.protocols/kv-reduce l rassoc! r)))
(defn- lmerge!
[l r]
(if (instance? clojure.lang.IKVReduce r)
(.kvreduce ^clojure.lang.IKVReduce r lassoc! l)
(clojure.core.protocols/kv-reduce r lassoc! l)))
(defn merge!
"Merges two maps, slightly faster than clojure.core/merge.
It will always use as 'source' the bigger map, leaving the small map as
the traversed one. It will be specially efficient if the second map is
a hash-map and the first is an array-map, as the first map won't be 'merged'
into the second, avoiding the transition from array-map to hash-map.
m2 will always override keys in m1, same as clojure.core/merge"
[m1 m2]
(with-meta
(if (< (count m1) (count m2))
(persistent! (rmerge! m1 (transient m2)))
(persistent! (lmerge! (transient m1) m2)))
(meta m1)))
edit: forgot the merge! fn, and better formatting
@bsless I read that reduce-kv is getting some love b.t.w to be more general in clojure 1.11. Might be the best option. It could also alter some of the type-specific benchmarks.
@bortex-io Very nice. Good observation about the type coercion costing us.
@joinr we found it out while teasing the implementation out on Slack, but I never see you there :) The only weakness I dislike wrt rassoc is it incurs a lookup cost. Ideally, clojure adds a method to "update" a key with a arity 2 function, baked directly into the map implementation. You can then implement assoc with it to save on code duplication. Should also refine interesting test cases for maps merge. 4x4, 4x5, 8x1, 1x8, etc
I was independently working on something similar, albeit with a focus on improving clojure.core/merge performance. Watching Tommi's talk reminded me of the slowness (and I end up using it in a lot of legacy code on hot paths...). I tried just optimizing the persistent map-based variant, which got me into the same ballpark you referenced (~30%). Going with transients and a useful heuristic yields some better fruit though (~50%)...
I did some testing to eliminate sources of slowdown: