Avoid copying in mdb_cursor_get if possible

batterseapower commented 10 years ago

The funny thing is that even with this change using mdb_cursor_get to do a sequential scan of an entire DB (holding 4 byte keys and values) still costs about 150ns per item.

In comparison my C code doing the same thing costs only 17ns per item -- i.e. using Java costs more than 130ns per item more! In comparison, using mdb_get to implement a sequential scan costs about 720ns per item in Java but 640ns per item in C, a slowdown of only about 80ns.

The number of MDB_val copies being done per iteration is the same in both cases (i.e. 1 in and 1 out for mdb_get and 2 out for mdb_cursor_get), so I'm not sure why Java is being penalised so much more for the mdb_cursor_get case than it is when doing a mdb_get.

batterseapower commented 10 years ago

This JMH benchmark kernel takes only about 26ns per iteration:

    @Benchmark
    @Threads(1)
    @BenchmarkMode(Mode.AverageTime)
    public int testZeroCopy() {
        final int result;
        if (rc == JNI.MDB_NOTFOUND) {
            rc = JNI.mdb_cursor_get_so_raw_it_hurts(cursor.pointer(), bufferPtr, bufferPtr + 2 * Unsafe.ADDRESS_SIZE, JNI.MDB_FIRST);
            result = 0;
        } else {
            Util.checkErrorCode(rc);
            if (unsafe.getAddress(bufferPtr + 2 * Unsafe.ADDRESS_SIZE) != 4) {
                throw new IllegalStateException("Value wrong size");
            }
            result = unsafe.getInt(unsafe.getAddress(bufferPtr + 3 * Unsafe.ADDRESS_SIZE));
            rc = JNI.mdb_cursor_get_so_raw_it_hurts(cursor.pointer(), bufferPtr, bufferPtr + 2 * Unsafe.ADDRESS_SIZE, JNI.MDB_NEXT);
        }

        return result;
    }

(Note that mdb_cursor_get_so_raw_it_hurts is just a wrapped version of mdb_cursor_get that takes all "long" arguments rather than structs. Consequently all memory management is done explicitly.)

In contrast this takes about 158ns per op:

    @Benchmark
    @Threads(1)
    @BenchmarkMode(Mode.AverageTime)
    public int testSmartCopyIn() {
        final int result;
        if (rc == JNI.MDB_NOTFOUND) {
            rc = JNI.mdb_cursor_get(cursor.pointer(), keyVal, valueVal, JNI.MDB_FIRST);
            result = 0;
        } else {
            Util.checkErrorCode(rc);
            if (valueVal.mv_size != 4) {
                throw new IllegalStateException("Value wrong size");
            }
            result = unsafe.getInt(valueVal.mv_data);
            rc = JNI.mdb_cursor_get(cursor.pointer(), keyVal, valueVal, JNI.MDB_NEXT);
        }

        return result;
    }

This kind of mysterious, and seems to point to (*env)->SetLongField being extraordinarily slow?

At the moment the perf penalty for doing the cursoring in Java is kind of ridiculous! 26ns per iteration is just about bearable given that C takes about 17-19ns, but 156ns is just silly...

phraktle commented 9 years ago

Shouldn't this pull request contain this commit? https://github.com/batterseapower/lmdbjni/commit/c579f875bc8a78e8486b30734592b975a93f321e

krisskross commented 9 years ago

Yes, you're right. The commit have been merged. Thanks!

krisskross commented 9 years ago

I have created a test branch that illustrate how to use mdb_cursor_get_address which I think is similar to testZeroCopy mentioned by @batterseapower.

https://github.com/deephacks/lmdbjni/tree/zero-copy-test

It would be interesting to explore this approach further and see how much closer lmdbjni can come to native C performance.

phraktle commented 9 years ago

Interesting direction, zero copy would sure be nice. Happy to benchmark it, if you could expose it on Cursor.

Would it be feasible to return direct ByteBuffers from C, instead of using Unsafe? See http://docs.oracle.com/javase/8/docs/technotes/guides/jni/jni-14.html#NewDirectByteBuffer

krisskross commented 9 years ago

I tried NewDirectByteBuffer earlier without hawtjni, similar to the approach described here: https://groups.google.com/forum/#!msg/mechanical-sympathy/MCRzv7UGT4w/GQWA8EWg_6oJ

However, this ended up slower than lmdbjni and I think the main reason was that NewDirectByteBuffer is quite slow. I think Unsafe have more potential of being faster, as indicated by the tests made by @batterseapower.

Of course the user should not need to do unsafe pointer chasing, but maybe Cursor could provide or consume (reusable) buffers that wrap around memory keys and values? Something similar to this class below.

https://github.com/real-logic/simple-binary-encoding/blob/master/main/java/uk/co/real_logic/sbe/codec/java/DirectBuffer.java

krisskross commented 9 years ago

Similarly, I think mdb_put with MDB_RESERVE could utilize such buffers to save a memcpy and reduce the amount of garbage produced.

phraktle commented 9 years ago

Strange, NewDirectByteBuffer should be plenty fast – definitely much faster than copying byte arrays around. Under the hood it's also using Unsafe to access the underlying pointer. Are you sure the slowness was not due to something else? Is the code available?

In any case, I expect using Unsafe directly should yield pretty much the same results as a direct buffer.

I'm generally more concerned with improving read performance, MDB_RESERVE raises other complications too.

phraktle commented 9 years ago

I have identified a large bottleneck in the current copying approach, that's easy to fix:

In Value.java, replace:

JNI.buffer_copy(mv_data, 0, rc, 0, rc.length);

with

Unsafe.getBytes(mv_data, 0, rc);

This gives me a 2x improvement in iteration speed. So it's safe to say that crossing the JNI boundary again for invoking the buffer copy code is very large overhead. As an interim step (before full zero-copy), this might be a good step forward.

krisskross commented 9 years ago

Definitely! Nice find. I'll mail you the tests I have using NewDirectByteBuffer.

deephacks / lmdbjni

Avoid copying in mdb_cursor_get if possible #4