Closed jagill closed 1 year ago
How are you running your benchmarks? In general, the JVM should hoist those checks outside of the loop once the call to the method gets inlined.
I'm running the benchmarks with JMH, using code that is basically identical to BenchmarkJtsGeometrySerde. It's calling code that is a modification of JtsGeometrySerde; I'm happy to provide more details of the modifications if that's helpful.
I'm happy to provide more details of the modifications if that's helpful.
Please do. I'd like to look at how the JVM is optimizing that code (inlining decisions, generated assembly, etc). The implementation of JtsGeometrySerde is doing a bunch of things that might be getting in the way.
You can see the immediate function call here. It's pretty simple, but perhaps something in the wider context is causing problems?
Thanks. Let me look into this.
Interesting, if I change the loop to:
double[] coordinateDoubles = new double[2 * count];
for (int i = 0; i < 2 * count; i++) {
coordinateDoubles[i] = input.readDouble();
}
the bounds checks go away (presumably hoisted):
Edit: Although the benchmarks don't change. And using the new Slice functions gets the same benchmark improvements.
I did some investigation. The code is being inlined properly, but the checks are too complicated for the VM to be able to make inferences, so it's doing extra checks in the middle of the loop (when comparing a version that uses getDouble vs getDoubleUnchecked). The version that uses SliceInput is better in eliding some of the checks, but it seems to be keeping track of extra counters that have larger impact than the extra checks.
CC: @pettyjamesm
Nice find and improvement, wouldn’t be surprised if other methods could benefit from a similar treatment
CC: @yingsu00
Ok, I've added the more flexible versions, and made the previous versions as convenience wrappers over those. @martint, is this a direction you are intersted in? And are you interested in extracting writeDoubles
to SliceOutput
and implementing for each subclass?
Excellent. Yes, that's a reasonable approach. Regarding writeDoubles
, yes, let's add that for symmetry. Are you planning to implement similar ones for other types?
@martint I've pulled addDoubles
up to SliceOutput
. I only implemented the doubles[]
methods: I felt that until there's a use case for the other types, the other implementations would just be unused code. The pattern in this PR can be easily adapted to other primitive arrays, should the need arise.
@martint Anything else that needs to be addressed with this PR, or is it good to merge?
It this still an ongoing concern? If so, please repopen.
BTW, the work around I use for this is the following which maps to a single unsafe memory copy:
public void readDoubles(Slice source, int position, double[] destination, int destinationIndex, int length)
{
Slice wrapper = Slices.wrappedDoubleArray(destination, destinationIndex, length);
wrapper.setBytes(0, source);
}
If multiple doubles are read from/written to a slice, bounds checking and allocations take up a significant fraction of CPU. When writing as a batch, do the allocations and bounds checking only once.
Geometry serde often involves thousands (or millions) of doubles as coordinates. Using these methods can reduce the serde CPU cost by 50%-80% for complex geometries.
In this PR, I've implemented
Slice.setDoubles()
/getDoubles()
, andSliceInput.readDoubles()
(including those classes that implementSliceInput
. I've only implementedDynamicOutputSlice.writeDoubles()
, because I wanted to get feedback on this PR before going further.Here's a flame graph just using
readDouble
repeated:And here's one with the new code:
Additionally, the new code increases throughput benchmarks by about 2x.