bluesky / tiled

API to structured data
https://blueskyproject.io/tiled
BSD 3-Clause "New" or "Revised" License
56 stars 48 forks source link

Investigate and optimize SQLite connection pool management #663

Closed danielballan closed 4 months ago

danielballan commented 7 months ago

We have observed what appears to be frequently (re-)opening of the SQLite catalog database. I have read that applications should hold a connection pool with one read-write connection and N read connections.

At minimum, we should avoid thrashing the file like this (unless I am confused). At best, we should optimize our pool management.

Demo:

# Start server and strace it. Note -f.
$ strace -fe open,openat tiled serve catalog /tmp/strace_test.db --init --api-key secret -w /tmp/strace_test_data/

Note that this has to be warmed up, as Tiled lazily imports some Python modules.

from tiled.client import from_uri
c = from_uri('http://localhost:8000', api_key='secret')
c.write_array([1,2,3])  # Warm up
c.write_array([1,2, 3])  # Test
strace: Process 197660 attached
[pid 197660] openat(AT_FDCWD, "/tmp/strace_test.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 15
[pid 197660] openat(AT_FDCWD, "/tmp/strace_test.db-wal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 16
[pid 197660] openat(AT_FDCWD, "/tmp/strace_test.db-shm", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 17
[pid 197635] openat(AT_FDCWD, "/tmp/strace_test_data/65d1e164-2a35-4422-959e-222ccbc57d73/.zarray", O_RDONLY|O_CLOEXEC) = 15
[pid 197635] openat(AT_FDCWD, "/tmp/strace_test_data/65d1e164-2a35-4422-959e-222ccbc57d73/0.f490a5d74b2a44f487f4f5c796f62228.partial", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 15
[pid 197657] +++ exited with 0 +++
INFO:     127.0.0.1:36470 - "PUT /api/v1/array/full/65d1e164-2a35-4422-959e-222ccbc57d73 HTTP/1.1" 200 OK
danielballan commented 7 months ago

I think the first mystery is: where do all these subprocesses come from?

nmaytan commented 6 months ago

No great insights yet, but I've made progress tracing these blocks of syscalls to where they're coming from in Tiled. Ultimately the subprocesses are opened by the await'd sqlalchemy methods. Below the stdout is recreated as in the OP and in the same order as generated. It's still possible there are artifacts of competing stdout to settle..

This process is created by the commit call in ‎CatalogNodeAdapter.create_node on line 587.

strace: Process 209069 attached
[pid 209069] openat(AT_FDCWD, "/tmp/strace_test.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 8
[pid 209069] openat(AT_FDCWD, "/tmp/strace_test.db-wal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 9
[pid 209069] openat(AT_FDCWD, "/tmp/strace_test.db-shm", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 10
[pid 209069] openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 12

This process is created by the refresh call just a few lines down on 595

strace: Process 209070 attached
[pid 209070] openat(AT_FDCWD, "/tmp/strace_test.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 8
[pid 209070] openat(AT_FDCWD, "/tmp/strace_test.db-wal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 9
[pid 209070] openat(AT_FDCWD, "/tmp/strace_test.db-shm", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 10

This is one of the "original" child processed Tiled creates, presumably opening the directory to update the data (I believe unrelated to the problem we're chasing).

[pid 209062] openat(AT_FDCWD, "/tmp/strace_test_data/c4725f4b-bddb-4831-b7ef-283ce4e63ead/.zarray.13b63d180985427bbb5da3616639b924.partial", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 12

At this point ExplainAsyncSession appears to be invoked but there is no resulting IO (possibly relevant as a later call does invoke IO). Link

This possibly comes from another db commit, although seemingly sharing the previous process from above. Also in ‎CatalogNodeAdapter.create_node, but later on on line 666.

[pid 209070] openat(AT_FDCWD, "/tmp", O_RDONLY|O_CLOEXEC) = 12

And another matching refresh immediately following on line 667.

strace: Process 209071 attached
[pid 209071] openat(AT_FDCWD, "/tmp/strace_test.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 8
[pid 209071] openat(AT_FDCWD, "/tmp/strace_test.db-wal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 9
[pid 209071] openat(AT_FDCWD, "/tmp/strace_test.db-shm", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 10

Here there is another apparent call to ExplainAsyncSession, but again no IO.

... followed by another call to ExplainAsyncSession, but this time with IO and a subprocess:

strace: Process 209072 attached
[pid 209072] openat(AT_FDCWD, "/tmp/strace_test.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 8
[pid 209072] openat(AT_FDCWD, "/tmp/strace_test.db-wal", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 9
[pid 209072] openat(AT_FDCWD, "/tmp/strace_test.db-shm", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 10

This is that original child process again, which I still presume is just normal operation of updating the data.

[pid 209062] openat(AT_FDCWD, "/tmp/strace_test_data/c4725f4b-bddb-4831-b7ef-283ce4e63ead/.zarray", O_RDONLY|O_CLOEXEC) = 8
[pid 209062] openat(AT_FDCWD, "/tmp/strace_test_data/c4725f4b-bddb-4831-b7ef-283ce4e63ead/0.b3976f75f05b4c6da98be95417106181.partial", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 8
nmaytan commented 6 months ago

This may be the issue: pre-2.0 sqlalchemy did not use connection pooling by default for sqlite databases.

As of 2.0, it should by default use sqlalchemy.pool.QueuePool - but QueuePool does not work with sqlalchemy.ext.asyncio.create_async_engine. Instead, it should default to sqlalchemy.pool.AsyncAdaptedQueuePool.

The interesting tidbit is that this is apparently not the case. print(engine.pool.status()) gives NullPool for us- which means no connection pooling. If I set poolclass manually, the connections seem to be reused!

engine = create_async_engine(uri, echo=echo, json_serializer=json_serializer, poolclass=AsyncAdaptedQueuePool)

[pid 10502] openat(AT_FDCWD, "/tmp/strace_test_data/128c82c2-69f1-4f19-919f-1a9bbc818c04/.zarray.22069755222f4102b24380f7ad6740c7.partial", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 12
[pid 10502] close(12)                   = 0
INFO:     127.0.0.1:48962 - "POST /api/v1/metadata/ HTTP/1.1" 200 OK
[pid 10502] openat(AT_FDCWD, "/tmp/strace_test_data/128c82c2-69f1-4f19-919f-1a9bbc818c04/.zarray", O_RDONLY|O_CLOEXEC) = 12
[pid 10502] close(12)                   = 0
[pid 10502] openat(AT_FDCWD, "/tmp/strace_test_data/128c82c2-69f1-4f19-919f-1a9bbc818c04/0.8f7982e2a8db4a69851346260783bf9b.partial", O_WRONLY|O_CREAT|O_TRUNC|O_CLOEXEC, 0666) = 12
[pid 10502] close(12)                   = 0
INFO:     127.0.0.1:48962 - "PUT /api/v1/array/full/128c82c2-69f1-4f19-919f-1a9bbc818c04 HTTP/1.1" 200 OK

It's not clear why the pool is not being set automatically. Supposedly we are using version >=2.0.

nmaytan commented 6 months ago

Oh - it seems that aiosqlite is a bit different, and defaults to NullPool - see here.

So, mystery solved. I guess the other thing to do, is consider if there are any unintended ill effects of using connection pooling with aiosqlite.

danielballan commented 6 months ago

Reading through various things, I'm having trouble understanding whether an application should close the SQLite file after a transaction or hold it open.

I'm also unsure about where the subprocesses, the various pids, are coming from. It does not seem that aiosqlite uses subprocess at all.

nmaytan commented 6 months ago

Agreed that the specific thing opening these subprocesses is mysterious. I've briefly combed through sqlalchemy and aiosqlite and haven't found anything that looks suspicious. What's also strange is that strace does not pick up anything (surrounding these processes) for clone, fork, vfork, or execve.

nmaytan commented 6 months ago

Okay, I'm becoming convinced that this is aiosqlite's doing. It creates a new thread per connection, so without connection pooling each reconnection means it will make a new thread.

A trivial test:

# terminal1
touch test.db
strace -f -tt -e openat,open -p $terminal2_PID

# terminal2 (in iPython)
import aiosqlite
c = await aiosqlite.connect('test.db')
await c.execute("CREATE TABLE mytable (col1 string)")
await c.execute("INSERT INTO mytable (col1) VALUES ('val1')")
await c.commit()
await c.close()

c = await aiosqlite.connect('test.db')
res = await c.execute("SELECT col1 FROM mytable")
await res.fetchone()
await c.close()

Both connections result in new PIDs:

strace: Process 111875 attached
[pid 111875] 17:32:52.029723 openat(AT_FDCWD, "/home/nmaytan/test.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 8

strace: Process 112256 attached
[pid 112256] 17:34:26.593616 openat(AT_FDCWD, "/home/nmaytan/test.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 9

This happens even if you open a connection to a database you already have open, as aiosqlite has no built in pooling/reuse.

Their documentation also explains this, if I'm understanding correctly:

aiosqlite allows interaction with SQLite databases on the main AsyncIO event loop without blocking execution of other coroutines while waiting for queries or data fetches. It does this by using a single, shared thread per connection.

nmaytan commented 6 months ago

Even more convincing:

In [26]: r = await aiosqlite.connect('test.db')

In [27]: r.native_id
Out[27]: 120553
strace: Process 120553 attached
[pid 120553] 18:09:26.444991 openat(AT_FDCWD, "/home/nmaytan/test.db", O_RDWR|O_CREAT|O_NOFOLLOW|O_CLOEXEC, 0644) = 12

edit: and to satisfy my own curiosity, it seems threading.Thread (which aiosqlite's Connection subclasses from) ultimately calls pthread_create(), which calls clone() in the end- if I'm reading the man page correctly:

Both threading implementations employ the Linux clone(2) system call.

nmaytan commented 6 months ago

In trying to gather some of the community consensus re: pooling sqlite connections, here is an attempted summary of that plus related details.

  1. We are using WAL on the catalog db, which means keeping connections open in the pool will prevent the database file from "checkpointing" as often. Checkpointing happens 1. when the last connection to the db closes or 2. when the page size of changes reaches 1000. We should consider either what a good page-size trigger is for us, or consider running the checkpoint ourselves. See here.
  2. There seems to be varying opinions on whether connection pooling should be used with sqlite. The primary argument against it is that opening a file handle is "cheap"/"fast" comparatively to a networked database connection, and that the performance gain (if any) is probably not worth the complexity of pool management (specifically if you need to implement it yourself). Not everyone agrees with this, and I also would consider that using a database over a networked FS may change the calculus.
  3. Any cache created by sqlite is lost upon dropping the database connection, which seems to hint that pooling connections could improve performance when queries frequently hit the cache. We might also want to consider tailoring the cache size via pragma.
  4. Before sqlalchemy started defaulting to QueuePool for sqlite, there was some discussion. The "downsides" seemed to be about a rollback happening in a competing thread and some vague worries about file locking. Nonetheless, the default behavior was changed. The OP and the PR both claim performance improvements from pooling. The docs however do recommend switching back to NullPool if issues with file locking are observed.
  5. Despite the threading concern just mentioned, sqlite is able to be thread safe, and must be for compatibility with sqlalchemy QueuePool. By default, sqlite compiles with "serialized" thread safety configuration (i.e. "fully safe"). pysqlite has a convenient way, as of python 3.11, to check this value. For RHEL, the default of serialized is used. It is possible to change this behavior at runtime, but pysqlite does not expose sqlite3_config() to do so [see here and here] and aiosqlite is directly on top of pysqlite. Moreover, aiosqlite by its nature is putting each connection in its own thread, so presumably connections will not escape from singular threads anyway.
  6. Reading around, it doesn't appear there is any need or rigid practice to close the db connection after each transaction, as the sqlite db is effectively in a "steady state" once a commit or rollback happens. It's important obviously to ensure commits are happening at sane points. However, use of WAL does make this a little more complicated, since checkpointing frequency should be considered as mentioned above.

Some links to the more worthwhile stuff:

  1. An aiosqlite issue where folks have some back and forth about whether connection pooling for sqlite is worthwhile. Notably it's pointed out that connection pooling/reuse must be used in order to fully benefit from database caching. A related SE post about someone wanting to use pooling to maintain the cache.
  2. Another aiosqlite issue of someone looking for a connection pool/reuse feature, which seems to be considered sane/safe to implement.
  3. Yet another aiosqlite issue with an opinion about connection pooling (in short: could be beneficial, but pessimistic if given that the database is local or in memory)
  4. An SO post where it is recommended to use pooling with sqlite if many queries are made.
  5. A detailed response on SO about pooling vs not for sqlite (not conclusive).

The most robust thing to do is probably benchmarking whether this is worth all of the thinking, but if I had to make a snap decision from all of this, I lean towards trying out the AsyncQueuePool & not constantly respawning processes/threads, and chattering file IO, and rebuilding the db cache, etc.

danielballan commented 6 months ago

This is so helpful!

Good find with pthread_create andclone. Now the strace output is clear.

I agree with your assessment. I think we should next add some benchmarks that hit the database pretty hard to our nascent benchmark suite. (See https://github.com/bluesky/tiled/pull/641.) Using those, we should verify that switching to AsyncQueuePool does not result in any unexpected regressions. If that goes well, resulting in a tie or better with NullPool performance, we should switch, because "less resource thrashing" is a good default position to take while we collect more information about our usage patterns, construct appropriate benchmarks, and identify bottlenecks across the service.

I suspect the issues with file locking mentioned on the SQLAlchemy issue tracker were from users that were not taking SQLite's advice about keeping SQLite to local storage.

danielballan commented 5 months ago

In a group-coding session, an asv benchmark validated that AsyncAdaptedQueuePool provides a measurable performance improvement over NullPool. A forthcoming PR will make the change and add the benchmark.

danielballan commented 4 months ago

Closed by #710