Open zakgof opened 5 years ago
Memory allocation with MSVC is known to be slow, that's not really JavaCPP's fault. JNA and BridJ don't use C++ to allocate memory. We could allocate memory the same way for JavaCPP with, for example, Pointer.malloc(), cast it to SYSTEMTIME, and that should be faster. Could you give that a try?
The below code performs indeed much faster:
Pointer raw = Pointer.malloc(systemTimeStructLength); // precalculated as systemTimeStructLength = new SYSTEMTIME().sizeof()
SYSTEMTIME systemtime = new SYSTEMTIME(raw);
windows.GetSystemTime(systemtime);
return systemtime.wSecond();
Now the question is, why not to generate a struct's default constructor implementation with Pointer.malloc
instead ?
We could, but it wouldn't be C++ :) I think Win32 doesn't throw C++ exceptions though, so we can probably speed this up with a @NoException
like here:
https://github.com/bytedeco/javacpp-presets/blob/master/mkl/src/main/java/org/bytedeco/mkl/presets/mkl_rt.java#L56
Ah, no, we already have @NoException
there. One other thing to be careful about on Windows: Memory deallocation is excruciatingly slow when a lot of memory is allocated, so make sure to deallocate as fast as possible. In this case, this will deallocate right away just before return:
try (SYSTEMTIME systemtime = new SYSTEMTIME()) {
windows.GetSystemTime(systemtime);
return systemtime.wSecond();
}
SYSTEMTIME is a C struct (with no constructor), and I believe that library users would prefer faster implementation with C rather than a slower one with C++.
I'd suggest modifying the parser to
class SYSTEMTIME extends Pointer {
private static final long STRUCT_SIZE = 16; // Calculated at generation time
public static long sizeof() {
return STRUCT_SIZE;
}
public static SYSTEMTIME malloc() {
return new SYSTEMTIME(Pointer.malloc(STRUCT_SIZE));
}
}
Before we start modifying everything just because MSVC allocation is slow, let's check how the try-with-resources version performs. It should work well enough.
Actually, no, C++ allocation isn't the bottleneck at all here. It's the deallocator registration which is slow. Pointer.malloc()
doesn't register any deallocator, so that's why it's fast.
More than half the time seems to be spent by the garbage collector browsing through the doubly-linked list of phantom references. If that's the case, there might not be much we can do about this other than simply not rely on the GC at all. The JDK itself uses doubly-linked lists for its own use of phantom references: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/ref/Cleaner.java https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/ref/PhantomCleanable.java BTW, JDK 11 seems to be a lot better at this than JDK 8. Make sure to upgrade your JDK!
FYI, starting with JavaCPP 1.5.6, we can now skip all that overhead and get very low latency by setting the "org.bytedeco.javacpp.nopointergc" system property to "true", see https://github.com/tensorflow/java/issues/313.
I run a simple benchmark calling window API's
GetSystemTime
using JavaCpp's built-in windows API wrappers. This code allocates a struct, calls the native API and fetches some field from the struct:Profiling shows that the first line takes >90% of the overall execution time
I believe that there is some space for optimization here. The same thing implemented with Bridj or JNR outperforms JNI+JavaCpp just because of faster allocation, see the benchmark at https://github.com/zakgof/java-native-benchmark.
Say, with Bridj allocation takes <50% of the overall time: