Multiget performance problem

bart-devylder commented 11 years ago

We're seeing an issue with the crakoon multiget performance (against a single node arakoon cluster, using the arakoon 1.6.0 deb from arakoon.org). the test case (code attached - it's C++ but uses the plain crakoon API) does 4096 multigets with a batch size of 1, value size is 4096:

./ara_multi_get --cluster test --nodes test_0,127.0.0.1,12345 keys: 4096, value_size: 4096, batch_size: 1 set took 4.94317 seconds -> 3.23679 MiB/s / 828.618 IOPS get took 0.243652 seconds -> 65.6674 MiB/s / 16810.9 IOPS multiget took 163.901 seconds -> 0.0976198 MiB/s / 24.9907 IOPS

This is a factor of 100 slower than the python client (from the arakoon git repo, branch 1.6):

In [3]: import ara_multi_get

In [4]: client = ara_multi_get.make_client()

In [5]: ara_multi_get.test_multigets?? Type: function Base Class: <type 'function'> String Form:<function test_multigets at 0x26406e0> Namespace: Interactive File: /home/arne/Projects/scrapyard/ara_multi_get.py Definition: ara_multi_get.test_multigets(client, items, batchsize) Source: def test_multigets(client, items, batchsize): keys = [ struct.pack('Q', k) for k in xrange(items) ]

logging.info("starting multigets")

with Timer() as t:
    for i in xrange(0, items / batchsize):
        j = i * batchsize
        client.multiGet(keys[j : j + batchsize])

logging.info("multigets with batchsize %d took %.03f sec: %.02f IOPS" %
             (batchsize,
              t.interval,
              items / t.interval))

In [6]: ara_multi_get.test_multigets(client, 4096, 1) 2013-09-20 10:27:40,816 starting multigets 2013-09-20 10:27:42,524 multigets with batchsize 1 took 1.707 sec: 2399.18 IOPS

This can also be observed with bigger batches (batchsize 64 -> C: ~ 1000 IOPS, Python ~3500 IOPS).

NicolasT commented 10 years ago

What exactly do you mean with 'batch size'?

NicolasT commented 10 years ago

As noted in the crakoon arakoon_multi_get code (TODO):

        iter = arakoon_value_list_create_iter(keys);
        FOR_ARAKOON_VALUE_ITER(iter, &value_size, &value) {
                /* TODO Multi syscall vs memory copies... */
                WRITE_BYTES(master, &value_size,
                        ARAKOON_PROTOCOL_UINT32_LEN, rc, &timeout);
                RETURN_IF_NOT_SUCCESS(rc);

                WRITE_BYTES(master, value, value_size, rc,
                        &timeout);
                RETURN_IF_NOT_SUCCESS(rc);
        }
        arakoon_value_list_iter_free(iter);

Unlike the Python client, which constructs a large string containing the whole request (including all keys) and sends this to the node using a single write call (or multiple, if required), the crakoon implementation first sends the command prefix using a write call, then uses 2 of them for every key in the request.

This can cause a significant syscall-overhead.

Using writev might help quite a bit, but some profiling would be in order to make sure this is the actual root cause of the performance difference. Initially cutting down the number of write calls by half by sending key size & content using a single writev call should be fairly easy to implement. In a second stage, sending everything using a single large iovec should reduce syscall overhead even further.

Incubaid / crakoon

Multiget performance problem #14