Tarsnap / kivaloo

Kivaloo is a collection of utilities which together form a data store associating keys of up to 255 bytes with values of up to 255 bytes.
http://www.tarsnap.com/kivaloo.html
Other
201 stars 17 forks source link

Kivaloo perf #240

Closed cperciva closed 3 years ago

gperciva commented 3 years ago

BTW, once https://github.com/Tarsnap/spiped/pull/308 is merged, I can add the branch that makes kivaloo share object files. :)

cperciva commented 3 years ago

Yeah, I haven't forgotten -- just wanted to get this branch off my laptop first.

gperciva commented 3 years ago

I imagine that you've already benchmarked this, but just for confirmation, there's no obvious bottlenecks at the level of the c source. When I used it for 100 iterations of test_lbs, 9% of the total time was spent on CRC32C_Update_SSE42, 6.85% on elasticarray_get(), and the other functions in the c code are even less of the total.

That said, I did notice that 13.06% is spent on clock_gettime(). That includes 6.94% of the total (relative to the overall time, not relative to the 13% in clock_gettim()) calling elf_aux_info. FreeBSD 12.2, amd64.

I wonder if clock_gettime() is checking for the available clocks every time we call it? I see that FreeBSD has a CLOCK_MONOTONIC_FAST (and Linux has a CLOCK_MONOTONIC_COARSE which looks like the same idea).

I don't have a good handle on the accuracy you want from this, nor how much of a trade-off there is in the freebsd kernel and/or system libraries with CLOCK_FOO_FAST. Also, even a theoretically instanteous gettime would only save 13% of the overall speed, so perhaps it's not worth the complexity.

gperciva commented 3 years ago

For comparison, out of libc, perf is using

time % function
13.06 clock_gettime
4.60 free
4.36 malloc
2.17 poll
1.51 memcpy
1.32 time
1.04 recv
1.03 `realloc
< 1 other functions

All of those make sense to me, other than clock_gettime.

cperciva commented 3 years ago

The call to elf_aux_info is libc looking up the VDSO timehands -- this allows clock_gettime to return without making a syscall (it gets the time from the CPU cycle counter + an offset). Using this I don't think the "fast"/"coarse" timers provide any speed benefit.

In any case, performance seems to be good enough for now -- I was able to push a bit over 10^6 kvlds operations per second through perf. We can always reassess it later if needed but I think it will be a while before we need to deal with anything beyond that point.

cperciva commented 3 years ago

@gperciva Can you glance over this and make sure I didn't do anything stupid while cleaning up the git history?

gperciva commented 3 years ago

Looks good to me.