bytedeco / javacpp

The missing bridge between Java and native C++
Other
4.52k stars 586 forks source link

Struct memory allocation is slow #299

Open zakgof opened 5 years ago

zakgof commented 5 years ago

I run a simple benchmark calling window API's GetSystemTime using JavaCpp's built-in windows API wrappers. This code allocates a struct, calls the native API and fetches some field from the struct:

        SYSTEMTIME systemtime = new SYSTEMTIME();
        windows.GetSystemTime(systemtime);
        return systemtime.wSecond();

Profiling shows that the first line takes >90% of the overall execution time

image

I believe that there is some space for optimization here. The same thing implemented with Bridj or JNR outperforms JNI+JavaCpp just because of faster allocation, see the benchmark at https://github.com/zakgof/java-native-benchmark.

Say, with Bridj allocation takes <50% of the overall time:

image

saudet commented 5 years ago

Memory allocation with MSVC is known to be slow, that's not really JavaCPP's fault. JNA and BridJ don't use C++ to allocate memory. We could allocate memory the same way for JavaCPP with, for example, Pointer.malloc(), cast it to SYSTEMTIME, and that should be faster. Could you give that a try?

zakgof commented 5 years ago

The below code performs indeed much faster:

        Pointer raw = Pointer.malloc(systemTimeStructLength); // precalculated as systemTimeStructLength = new SYSTEMTIME().sizeof()
        SYSTEMTIME systemtime = new SYSTEMTIME(raw);
        windows.GetSystemTime(systemtime);
        return systemtime.wSecond();

image

Now the question is, why not to generate a struct's default constructor implementation with Pointer.malloc instead ?

saudet commented 5 years ago

We could, but it wouldn't be C++ :) I think Win32 doesn't throw C++ exceptions though, so we can probably speed this up with a @NoException like here: https://github.com/bytedeco/javacpp-presets/blob/master/mkl/src/main/java/org/bytedeco/mkl/presets/mkl_rt.java#L56

saudet commented 5 years ago

Ah, no, we already have @NoException there. One other thing to be careful about on Windows: Memory deallocation is excruciatingly slow when a lot of memory is allocated, so make sure to deallocate as fast as possible. In this case, this will deallocate right away just before return:

try (SYSTEMTIME systemtime = new SYSTEMTIME()) {
    windows.GetSystemTime(systemtime);
    return systemtime.wSecond();
}
zakgof commented 5 years ago

SYSTEMTIME is a C struct (with no constructor), and I believe that library users would prefer faster implementation with C rather than a slower one with C++.

I'd suggest modifying the parser to

class SYSTEMTIME extends Pointer {

    private static final long STRUCT_SIZE = 16; // Calculated at generation time

    public static long sizeof() {
        return STRUCT_SIZE;
    }

    public static SYSTEMTIME malloc() {
        return new SYSTEMTIME(Pointer.malloc(STRUCT_SIZE));
    }
}
saudet commented 5 years ago

Before we start modifying everything just because MSVC allocation is slow, let's check how the try-with-resources version performs. It should work well enough.

saudet commented 5 years ago

Actually, no, C++ allocation isn't the bottleneck at all here. It's the deallocator registration which is slow. Pointer.malloc() doesn't register any deallocator, so that's why it's fast.

saudet commented 5 years ago

More than half the time seems to be spent by the garbage collector browsing through the doubly-linked list of phantom references. If that's the case, there might not be much we can do about this other than simply not rely on the GC at all. The JDK itself uses doubly-linked lists for its own use of phantom references: https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/ref/Cleaner.java https://github.com/openjdk/jdk/blob/master/src/java.base/share/classes/jdk/internal/ref/PhantomCleanable.java BTW, JDK 11 seems to be a lot better at this than JDK 8. Make sure to upgrade your JDK!

saudet commented 3 years ago

FYI, starting with JavaCPP 1.5.6, we can now skip all that overhead and get very low latency by setting the "org.bytedeco.javacpp.nopointergc" system property to "true", see https://github.com/tensorflow/java/issues/313.