Open ktsakalozos opened 3 years ago
The perf-test project is to simulate how kubernetes access its datastore under load.
Hopefully this can help give insights where the bottleneck occurs. I imagine using multiple dqlite nodes can further increase the chances of this happening.
Thanks a lot for this!
The problem is mainly related to the fact that there can be only 1 simultaneous writer to the DB, and an unlucky streak where a writer finds the DB locked every time it retries. I've had success by using a more aggressive retry strategy, see https://github.com/MathieuBordere/perf-test-dqlite/tree/aggressive-retry
Can you give that a try?
I will give this a try. Another question, does this also block reads?
It should generally not block reads, see https://sqlite.org/wal.html
Thanks. i tried it to be more aggressive retries. It looks better.
I understand that insert
is bound by single writer, im trying the read (query only) by having about a thousand go routines. Its normal to have thousands of go routines in kubernetes.
it will fail after some time.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-432 took 2.033171036s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-700 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-700 took 3.58400125s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-389 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-389 took 3.587552684s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-931 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-931 took 2.68671149s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-430 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-430 took 1.966152604s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-255 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-255 took 1.308847448s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:33:02 norse systemd[1]: snap.perf-dqlite.daemon-perf.service: Main process exited, code=exited, status=1/FAILURE
Sep 04 15:33:02 norse systemd[1]: snap.perf-dqlite.daemon-perf.service: Failed with result 'exit-code'.
Sep 04 15:33:03 norse systemd[1]: snap.perf-dqlite.daemon-perf.service: Scheduled restart job, restart counter is at 2.
Sep 04 15:33:03 norse systemd[1]: Stopped Service for snap application perf-dqlite.daemon-perf.
Sep 04 15:33:03 norse systemd[1]: Started Service for snap application perf-dqlite.daemon-perf.
Update:
I tried this scenario using only sqlite, the context deadline exceeded
also happens when i mix inserts and queries or inserts only with thousands of go-routines. But, running queries only in sqlite do not cause the context deadline exceeded
error nor it ends abruptly, but dqlite it fails as shown above.
Thanks for the update, will have a look.
@balchua I cannot reproduce the crash atm with only readers, but I'm able to reproduce the context deadline exceeded
error. It is my impression that the operation is CPU-bound. I am able to run the test stable with 300 concurrent readers, but it fails (rather quickly) with a 1000 readers. I'm not really sure if it is a big issue, hundreds of readers already seems like a lot (for dqlite).
Thanks @MathieuBordere for taking time to look into this. I have no idea why it crashes, my guess is that i have an older or different version of dqlite. I will have to try this again on a clean VM. Wrt to the number of readers, as far as i can tell, kubernetes do have thousands of go routines.
While I don't know if all these goroutines accesses the api server (and hence dqlite) but it does add a lot of pressure to the data store as soon as the user starts to use operators.
Last question, is there plan to have read only nodes?
Hi.
I came across this BEGIN CONCURRENT
in sqlite. Has anyone tried it?
I haven't, but looks interesting at first sight
@balchua crafted a small performance test that results into a
context deadline exceeded
under fairly common circumstances (i think). The project with this work in [1] creates a few go routines that insert and query a dqlite database.To reproduce the issue I build the snap with
snapcraft
(thank you @balchua for making this easy for us to reproduce), created a VM with 4 cores and 4GB of ram withmultipass launch -c 4 -m 4G
(to be sure no local libraries are used), install the snap and then run:I am attaching here the logs of a run. perf-dqlite.log
[1] https://github.com/balchua/perf-test-dqlite.git