canonical / dqlite

Embeddable, replicated and fault-tolerant SQL engine.
https://dqlite.io
Other
3.83k stars 216 forks source link

server: Expose uv block_size setting. #478

Closed MathieuBordere closed 1 year ago

MathieuBordere commented 1 year ago

Exposes the raft setting to set the uv->block_size. I started digging a bit because I had horrible results with the dqlite-benchmark tool on my laptop. I noticed that +- 99.5% of the async writes failed with EAGAIN on my laptop with 4KB block size. Changing the block size to 64KB increase success rate of the async writes to +- 80% and doubles the write throughput (or halves the average time a default write by the benchmark tool takes) on my 2 machines.

I think it's worth exposing this setting so that users can experiment with it. A lot will depend on the kind of workload I think.

codecov[bot] commented 1 year ago

Codecov Report

Merging #478 (f967641) into master (704532e) will increase coverage by 0.06%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #478      +/-   ##
==========================================
+ Coverage   60.97%   61.03%   +0.06%     
==========================================
  Files          34       34              
  Lines        6339     6349      +10     
  Branches     1884     1886       +2     
==========================================
+ Hits         3865     3875      +10     
  Misses       1293     1293              
  Partials     1181     1181              
Impacted Files Coverage Δ
src/server.c 56.29% <100.00%> (+0.82%) :arrow_up:
MathieuBordere commented 1 year ago

@freeekanayaka Can you share some insight regarding the uv->block_size parameter in libraft and how it was chosen? Is the 4KB maximum size https://github.com/canonical/raft/blob/5243fa2cb568456f58dd6e5852fc11d95cd08b72/src/uv_fs.c#L667 chosen for some (extra) data durability guarantees or ... ? Any insight is welcome.

freeekanayaka commented 1 year ago

@freeekanayaka Can you share some insight regarding the uv->block_size parameter in libraft and how it was chosen? Is the 4KB maximum size https://github.com/canonical/raft/blob/5243fa2cb568456f58dd6e5852fc11d95cd08b72/src/uv_fs.c#L667 chosen for some (extra) data durability guarantees or ... ? Any insight is welcome.

I think 4KB was the best practical choice at that time, since it was what the kernel recommended, but I might misremember or things might have changed in the meantime, also hardware-wise.

It should be ok to modify it.

freeekanayaka commented 1 year ago

Instead of exposing this setting perhaps fixing the auto-detection would be better? Of course both things can be done if you think it's important to have a way to bypass the auto-detection. However, if the auto-detection is reasonably good, perhaps it's ok not expose it, one less knob. If it turns out that there are case where the auto-detection is faulty, it can be fixed (like in this case). Just thinking loud, I'm fine either way,

MathieuBordere commented 1 year ago

Yes, I think there's certainly room to improve the auto detection and was planning to eventually do that, this is more of a first step to have an easy way to turn a knob and understand the influence of the parameter.

cole-miller commented 1 year ago

Is this ready to merge? I don't think moving the size checks into libraft is necessarily a blocking concern, and it would be nice to make it available for others to experiment with.

MathieuBordere commented 1 year ago

I'll still move it to libraft.

cole-miller commented 1 year ago

I think we're going to defer the libraft implementation of this API -- the main thing is to have a knob for this somewhere that go-dqlite (and the C client) can see it.

calvin2021y commented 1 year ago

Exposes the raft setting to set the uv->block_size. I started digging a bit because I had horrible results with the dqlite-benchmark tool on my laptop. I noticed that +- 99.5% of the async writes failed with EAGAIN on my laptop with 4KB block size. Changing the block size to 64KB increase success rate of the async writes to +- 80% and doubles the write throughput (or halves the average time a default write by the benchmark tool takes) on my 2 machines.

I think it's worth exposing this setting so that users can experiment with it. A lot will depend on the kind of workload I think.

maybe you can try reformt your disk LBA Format.

freeekanayaka commented 1 year ago

FWIW libuv's has now support for io_uring for file operations, so it should be relatively easy now to move away from the complicated aio-based code that we use now for async file writes. Or alternatively go for native io_uring. AFAIK io_uring takes care of details like optimal block size.

calvin2021y commented 1 year ago

@freeekanayaka

libuv + io uring is ready : https://github.com/libuv/libuv/commit/96e05543f53b19d9642b4b0dd73b86ad3cea313e

MathieuBordere commented 1 year ago

FWIW libuv's has now support for io_uring for file operations, so it should be relatively easy now to move away from the complicated aio-based code that we use now for async file writes. Or alternatively go for native io_uring. AFAIK io_uring takes care of details like optimal block size.

I also would support to depend on a specialized system to detect optimal block size instead ofdoing it ourselves suboptimally.

freeekanayaka commented 1 year ago

FWIW libuv's has now support for io_uring for file operations, so it should be relatively easy now to move away from the complicated aio-based code that we use now for async file writes. Or alternatively go for native io_uring. AFAIK io_uring takes care of details like optimal block size.

I also would support to depend on a specialized system to detect optimal block size instead ofdoing it ourselves suboptimally.

Is there one?

Switching to io_uring via libuv might be an easier and longer-term fix (in the sense that the aio kernel subsystem is pretty much deprecated right now).

MathieuBordere commented 1 year ago

FWIW libuv's has now support for io_uring for file operations, so it should be relatively easy now to move away from the complicated aio-based code that we use now for async file writes. Or alternatively go for native io_uring. AFAIK io_uring takes care of details like optimal block size.

I also would support to depend on a specialized system to detect optimal block size instead ofdoing it ourselves suboptimally.

Is there one?

yeah I meant io_uring doing it for us, or some other database backend.