make sure inlining is wrapped in a function that can be jit'd

joinr commented 3 years ago

It may not make a difference, but something I missed when doing some recent toy profiling at the REPL measuring traversals between arrays and vectors:

When testing areduce over a primitive array vs. a call to reduce over a vector for iteration comparison, the raw areduce expression ended up being confusingly slower (or close to) the boxed HAMT vector reduction. This seemed very odd, since prior experience indicated primitive array traversal was blazingly fast. I then ensured that the macroexpansion from areduce happened inside a wrapper function, like traverse-arr

user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))  
"Elapsed time: 30.8967 msecs"
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 16.4168 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 16.7463 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 15.8058 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 18.5438 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 19.4693 msecs"
nil
user=> (time (dotimes [i 1] (areduce ys idx acc 0 (aget ^longs ys idx))))
"Elapsed time: 17.4582 msecs"

(defn traverse-arr [^longs arr]
  (areduce ys idx acc nil (aget ^longs arr  idx)))

and got my expected performance.

user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 30.0336 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 17.178 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 3.6118 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 10.2396 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 15.9891 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 4.4253 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 3.7816 msecs"
nil
user=> (time (dotimes [i 1] (traverse-arr ys)))
"Elapsed time: 3.9419 msecs"
nil

It seems the JIT is kicking in on the tiny function, but not the areduce call, which is a macro expansion into a loop/recur form. Very interesting. Might be worth a look to make sure the JIT isn't being restricted in the inlined forms as well (I haven't looked hard).

bsless commented 3 years ago

For some reason it looks like the expressions aren't compiled to the same bytecode. Decompiled areduce:

public final class bench$fn__19404 extends AFunction
{
    public static final Var const__0;

    public static Object invokeStatic() {
        final Object a__6487__auto__19406 = bench$fn__19404.const__0.getRawRoot();
        final Object l__6488__auto__19407 = Reflector.invokeStaticMethod(RT.classForName("clojure.lang.RT"), "alength", new Object[] { a__6487__auto__19406 });
        long idx = 0L;
        long acc = 0L;
        while (Numbers.lt(idx, l__6488__auto__19407)) {
            final long n = RT.intCast(idx) + 1;
            acc = ((long[])bench$fn__19404.const__0.getRawRoot())[RT.intCast(idx)];
            idx = n;
        }
        return Numbers.num(acc);
    }

    @Override
    public Object invoke() {
        return invokeStatic();
    }

    static {
        const__0 = RT.var("clj-fast.bench", "numarr");
    }
}

decompiled defn

public final class bench$traverse_arr extends AFunction
{
    public static Object invokeStatic(final Object arr) {
        final Object a__6487__auto__19396 = arr;
        final int l__6488__auto__19397 = ((long[])a__6487__auto__19396).length;
        long idx = 0L;
        Object acc = null;
        while (idx < l__6488__auto__19397) {
            final long n = RT.intCast(idx) + 1;
            acc = Numbers.num(RT.aget((long[])arr, RT.intCast(idx)));
            idx = n;
        }
        return acc;
    }

    @Override
    public Object invoke(final Object arr) {
        return invokeStatic(arr);
    }
}

I think there's a slight risk of testing in the repl

joinr commented 3 years ago

I didn't think to decompile. Definitely different implementations; one goes through numbers ns (lt vs <), and even has a really round-about lookup for the long array (binding to const0 and doing the rawroot lookup). Very curious. Wonder why this doesn't happen in the defn...

bsless commented 3 years ago

Probably the compiler evaluates it differently than it does a dynamic form

bsless commented 3 years ago

Given what I've found here I'm closing this issue. Feel free to reopen it whenever you think is appropriate or with new findings.

bsless / clj-fast

make sure inlining is wrapped in a function that can be jit'd #16