Can PForUtil be further auto-vectorized? [LUCENE-9918]

apache / lucene

Apache Lucene open-source search software

https://lucene.apache.org/

Apache License 2.0

2.67k stars 1.04k forks source link

Can PForUtil be further auto-vectorized? [LUCENE-9918] #10957

Open asfimport opened 3 years ago

asfimport commented 3 years ago

While working on #10889, we discovered the loop in PForUtil::prefixSumOf is not getting auto-vectorized by the HotSpot compiler. We tried a few different tweaks to see if we could change this, but came up empty. There are some additional suggestions in the related PR that could still be experimented with, and it may be worth doing so to see if further improvements could be squeezed out.

Migrated from LUCENE-9918 by Greg Miller (@gsmiller), updated Aug 20 2021

asfimport commented 3 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

I've setup a microbenchmark project over here to help explore this more easily if anyone is interested. I'll probably mess around with this a bit, but don't let that stop you from working on it if interested :)

asfimport commented 3 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

I think my (fairly naive) question here is mainly why the "multiplication loop" in the below code isn't able to get vectorized. Both the array copy and the "addition loop" are getting vectorized, but not the "multiplication loop." (I've put the decompiled assembly that I believe is relevant in the README in the above-referenced benchmark project).

     protected void prefixSumOf(long[] longs, long base, long val) {
        System.arraycopy(IDENTITY_PLUS_ONE, 0, longs, 0, ForUtil.BLOCK_SIZE);
        for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
            longs[i] *= val;
        }
        for (int i = 0; i < ForUtil.BLOCK_SIZE; ++i) {
            longs[i] += base;
        }
    }

asfimport commented 3 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

I'll also mention that @rmuir has some thoughts over here on some other ideas to try if anyone is interested in poking around more.

asfimport commented 3 years ago

Gautam Worah (@gautamworah96) (migrated from JIRA)

Hey @gsmiller , I noticed that in the micro benchmark code in your lucene-pfor-benchmark [repo |#L15], the main loop runs 10 times I think?

Some sources suggest that usually the JIT compiler compiles and optimizes statements as and when it sees that a particular operation is repeated multiple times. So it first optimizes them a little and them some more iff it sees them again. So maybe we just need to repeat the experiment with say 100k iterations (and check if it is auto vectorized)?

asfimport commented 3 years ago

Greg Miller (@gsmiller) (migrated from JIRA)

@gautamworah96 could be worth trying. I wasn't aware of the concept of compiling/optimizing multiple times, but I'm no expert in the space. My understanding was that the decision to compile would happen once, and the code would get compiled. I was able to verify that the method under test is getting compiled by HotSpot, but you're suggesting that maybe with more iterations it would get compiled into something different? Wouldn't be too hard to try if you'd like to pull down that branch I was working against.

For reference, it's setup to run 10 warmup iterations before running the test for 10 iterations: https://github.com/gsmiller/lucene-pfor-benchmark/blob/main/src/main/java/gsmiller/DecodeBenchmark.java