canonical / dqlite

Embeddable, replicated and fault-tolerant SQL engine.
https://dqlite.io
Other
3.77k stars 212 forks source link

context deadline exceeded in perf-test #321

Open ktsakalozos opened 2 years ago

ktsakalozos commented 2 years ago

@balchua crafted a small performance test that results into a context deadline exceeded under fairly common circumstances (i think). The project with this work in [1] creates a few go routines that insert and query a dqlite database.

To reproduce the issue I build the snap with snapcraft (thank you @balchua for making this easy for us to reproduce), created a VM with 4 cores and 4GB of ram with multipass launch -c 4 -m 4G (to be sure no local libraries are used), install the snap and then run:

sudo -E LIBRAFT_TRACE="1" LIBDQLITE_TRACE="1" /snap/perf-dqlite/current/perf-dqlite

I am attaching here the logs of a run. perf-dqlite.log

[1] https://github.com/balchua/perf-test-dqlite.git

balchua commented 2 years ago

The perf-test project is to simulate how kubernetes access its datastore under load.

Hopefully this can help give insights where the bottleneck occurs. I imagine using multiple dqlite nodes can further increase the chances of this happening.

MathieuBordere commented 2 years ago

Thanks a lot for this!

MathieuBordere commented 2 years ago

The problem is mainly related to the fact that there can be only 1 simultaneous writer to the DB, and an unlucky streak where a writer finds the DB locked every time it retries. I've had success by using a more aggressive retry strategy, see https://github.com/MathieuBordere/perf-test-dqlite/tree/aggressive-retry

Can you give that a try?

balchua commented 2 years ago

I will give this a try. Another question, does this also block reads?

MathieuBordere commented 2 years ago

It should generally not block reads, see https://sqlite.org/wal.html

balchua commented 2 years ago

Thanks. i tried it to be more aggressive retries. It looks better. I understand that insert is bound by single writer, im trying the read (query only) by having about a thousand go routines. Its normal to have thousands of go routines in kubernetes.

it will fail after some time.

Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-432 took 2.033171036s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-700 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-700 took 3.58400125s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-389 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-389 took 3.587552684s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-931 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-931 took 2.68671149s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-430 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-430 took 1.966152604s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:476 not needed
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE query_barrier_cb:364 query barrier cb
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 runner-255 - Retrieved successfully 55919.
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: 2021/09/04 15:32:45 TIMETRACKING: query - runner-255 took 1.308847448s
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE gateway__handle:991 gateway handle
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE handle_query_sql:537 handle query sql
Sep 04 15:32:45 norse perf-dqlite.daemon-perf[71066]: LIBDQLITE leader__barrier:473 leader barrier
Sep 04 15:33:02 norse systemd[1]: snap.perf-dqlite.daemon-perf.service: Main process exited, code=exited, status=1/FAILURE
Sep 04 15:33:02 norse systemd[1]: snap.perf-dqlite.daemon-perf.service: Failed with result 'exit-code'.
Sep 04 15:33:03 norse systemd[1]: snap.perf-dqlite.daemon-perf.service: Scheduled restart job, restart counter is at 2.
Sep 04 15:33:03 norse systemd[1]: Stopped Service for snap application perf-dqlite.daemon-perf.
Sep 04 15:33:03 norse systemd[1]: Started Service for snap application perf-dqlite.daemon-perf.

Update: I tried this scenario using only sqlite, the context deadline exceeded also happens when i mix inserts and queries or inserts only with thousands of go-routines. But, running queries only in sqlite do not cause the context deadline exceeded error nor it ends abruptly, but dqlite it fails as shown above.

MathieuBordere commented 2 years ago

Thanks for the update, will have a look.

MathieuBordere commented 2 years ago

@balchua I cannot reproduce the crash atm with only readers, but I'm able to reproduce the context deadline exceeded error. It is my impression that the operation is CPU-bound. I am able to run the test stable with 300 concurrent readers, but it fails (rather quickly) with a 1000 readers. I'm not really sure if it is a big issue, hundreds of readers already seems like a lot (for dqlite).

balchua commented 2 years ago

Thanks @MathieuBordere for taking time to look into this. I have no idea why it crashes, my guess is that i have an older or different version of dqlite. I will have to try this again on a clean VM. Wrt to the number of readers, as far as i can tell, kubernetes do have thousands of go routines.

While I don't know if all these goroutines accesses the api server (and hence dqlite) but it does add a lot of pressure to the data store as soon as the user starts to use operators.

Last question, is there plan to have read only nodes?

balchua commented 2 years ago

Hi. I came across this BEGIN CONCURRENT in sqlite. Has anyone tried it?

https://sqlite.org/src/doc/begin-concurrent/doc/begin_concurrent.md#:~:text=Usually%2C%20SQLite%20allows%20at%20most,system%20still%20serializes%20COMMIT%20commands

MathieuBordere commented 2 years ago

I haven't, but looks interesting at first sight