Implement batch double[] read/writes for Slice

jagill commented 4 years ago

If multiple doubles are read from/written to a slice, bounds checking and allocations take up a significant fraction of CPU. When writing as a batch, do the allocations and bounds checking only once.

Geometry serde often involves thousands (or millions) of doubles as coordinates. Using these methods can reduce the serde CPU cost by 50%-80% for complex geometries.

In this PR, I've implemented Slice.setDoubles()/getDoubles(), and SliceInput.readDoubles() (including those classes that implement SliceInput. I've only implemented DynamicOutputSlice.writeDoubles(), because I wanted to get feedback on this PR before going further.

Here's a flame graph just using readDouble repeated:

And here's one with the new code:

Additionally, the new code increases throughput benchmarks by about 2x.

martint commented 4 years ago

How are you running your benchmarks? In general, the JVM should hoist those checks outside of the loop once the call to the method gets inlined.

jagill commented 4 years ago

I'm running the benchmarks with JMH, using code that is basically identical to BenchmarkJtsGeometrySerde. It's calling code that is a modification of JtsGeometrySerde; I'm happy to provide more details of the modifications if that's helpful.

martint commented 4 years ago

I'm happy to provide more details of the modifications if that's helpful.

Please do. I'd like to look at how the JVM is optimizing that code (inlining decisions, generated assembly, etc). The implementation of JtsGeometrySerde is doing a bunch of things that might be getting in the way.

jagill commented 4 years ago

You can see the immediate function call here. It's pretty simple, but perhaps something in the wider context is causing problems?

martint commented 4 years ago

Thanks. Let me look into this.

jagill commented 4 years ago

Interesting, if I change the loop to:

        double[] coordinateDoubles = new double[2 * count];
        for (int i = 0; i < 2 * count; i++) {
            coordinateDoubles[i] = input.readDouble();
        }

the bounds checks go away (presumably hoisted):

Edit: Although the benchmarks don't change. And using the new Slice functions gets the same benchmark improvements.

martint commented 4 years ago

I did some investigation. The code is being inlined properly, but the checks are too complicated for the VM to be able to make inferences, so it's doing extra checks in the middle of the loop (when comparing a version that uses getDouble vs getDoubleUnchecked). The version that uses SliceInput is better in eliding some of the checks, but it seems to be keeping track of extra counters that have larger impact than the extra checks.

mbasmanova commented 4 years ago

CC: @pettyjamesm

pettyjamesm commented 4 years ago

Nice find and improvement, wouldn’t be surprised if other methods could benefit from a similar treatment

mbasmanova commented 4 years ago

CC: @yingsu00

jagill commented 4 years ago

Ok, I've added the more flexible versions, and made the previous versions as convenience wrappers over those. @martint, is this a direction you are intersted in? And are you interested in extracting writeDoubles to SliceOutput and implementing for each subclass?

martint commented 4 years ago

Excellent. Yes, that's a reasonable approach. Regarding writeDoubles, yes, let's add that for symmetry. Are you planning to implement similar ones for other types?

jagill commented 4 years ago

@martint I've pulled addDoubles up to SliceOutput. I only implemented the doubles[] methods: I felt that until there's a use case for the other types, the other implementations would just be unused code. The pattern in this PR can be easily adapted to other primitive arrays, should the need arise.

jagill commented 4 years ago

@martint Anything else that needs to be addressed with this PR, or is it good to merge?

dain commented 1 year ago

It this still an ongoing concern? If so, please repopen.

BTW, the work around I use for this is the following which maps to a single unsafe memory copy:

public void readDoubles(Slice source, int position, double[] destination, int destinationIndex, int length)
{
    Slice wrapper = Slices.wrappedDoubleArray(destination, destinationIndex, length);
    wrapper.setBytes(0, source);
}

airlift / slice

Implement batch double[] read/writes for Slice #131