content-sqlite: preallocate database space

garlick commented 3 months ago

Problem: when the disk containing the sqlite database fills up on rank 0, flux does not handle it well, e.g.

6155
6154
6100
6099
5978 (edit via @chu11)

@trws suggested providing an option to pre-allocate space in the database.

One approach is to write a huge blob to the db and remove it: https://stackoverflow.com/questions/827010/sqlite-pre-allocating-database-size

Some steps may also need to be taken to ensure writes to content.sqlite-wal do not fail. More study required on that one: https://www.sqlite.org/wal.html

chu11 commented 3 months ago

i was thinking about this last night and my initial intuition was that kind of trick wouldn't work. I thought that in most modern databases they constantly append new data and the only way to reclaim disk space is through "compacting"/"defragmenting" it. In sqlite that appears to be via "vacuum" (https://www.sqlite.org/lang_vacuum.html).

And even if we did this, we might be racing against the thingie filling up disk (e.g. errant logging). But it's better than nothing.

The hack above seems to work around this fact by pre-allocating an entire database table then dropping it. So perhaps this will work around it, but it seems to suggest we need to pre-allocate the entire database table before creating it the first time? Perhaps this idea might work if we pre-allocate critical "checkpoint" and similar things to a separate database table than the core KVS? (I guess this would be via some other backing store ... Or maybe a way to load backing store module twice?)

Nevertheless, it's more than worthwhile to experiment with this idea and see.

garlick commented 3 months ago

We should do an experiment, but based on those stack overflow comments, it seems like sqlite must reuse space within the db file, and vacuum just shrinks the file to not contain any free space.

And even if we did this, we might be racing against the thingie filling up disk (e.g. errant logging). But it's better than nothing.

I was thinking this would be done up front when flux starts. The sys admins would configure content.sqlite.preallocate = "20G" or something and if that cannot be grabbed when the module is initialized it fails and flux fails to start. So if the disk fills up later, we're immune.

Same basic idea as a partition but apparently a partition is more of a pain for the sys admins.

chu11 commented 3 months ago

So the first test didn't seem to work.

Created a 5M tmpfs.

$ rm -rf /test/tmpfs-5m/statedir/*
$ df -h /test/tmpfs-5m/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           5.0M     0  5.0M   0% /test/tmpfs-5m
$ src/cmd/flux start -o,-Sstatedir=/test/tmpfs-5m/statedir
(inside instance)
$ exit
$ ls -lh /test/tmpfs-5m/statedir/
total 20K
-rw-r----- 1 chu11 chu11 20K Aug  5 23:30 content.sqlite

so we got about a 20K database file. Lets do the trick listed in the stackoverflow.

$ sqlite3 /test/tmpfs-5m/statedir/content.sqlite
SQLite version 3.26.0 2018-12-01 12:34:55
Enter ".help" for usage hints.
sqlite> create table large (a);
sqlite> insert into large values (zeroblob(1024*1024));
sqlite> drop table large;
sqlite> .q
$ ls -lh /test/tmpfs-5m/statedir/
total 1.1M
-rw-r----- 1 chu11 chu11 1.1M Aug  5 23:31 content.sqlite

ok good, now the database is listed as about 1 meg.

re-start the instance w/ this statedir, and fill up the statedir with some kruft via script I wrote.

$ src/cmd/flux start -o,-Sstatedir=/test/tmpfs-5m/statedir
$ ./filldir.sh /test/tmpfs-5m/statedir/
$ df -h /test/tmpfs-5m/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs           5.0M  5.0M     0 100% /test/tmpfs-5m
$ ls -lh /test/tmpfs-5m/statedir/
total 5.0M
-rw-r----- 1 chu11 chu11 1.1M Aug  5 23:31 content.sqlite
-rw-r----- 1 chu11 chu11  32K Aug  5 23:31 content.sqlite-shm
-rw-r----- 1 chu11 chu11  57K Aug  5 23:31 content.sqlite-wal
-rw-r----- 1 chu11 chu11 100K Aug  5 23:31 foo0.txt
-rw-r----- 1 chu11 chu11 100K Aug  5 23:31 foo1.txt
-rw-r----- 1 chu11 chu11 100K Aug  5 23:31 foo10.txt
<snip>

So if sqlite is re-using the deadspace, I should be able to run a job. But it didn't work

$ flux run hostname
Aug 05 23:32:38.790233 UTC content-sqlite.err[0]: store: executing stmt: database or disk is full(13)
Aug 05 23:32:38.790273 UTC content-sqlite.err[0]: store: executing stmt: database or disk is full(13)

My suspicion is that this issue is b/c we use the write-ahead-log. There may technically be space in the actual database that can be re-used, but there is no space for the write ahead log.

Further playing around / investigation to be done.

As an aside, I tried using VACUUM

$ sqlite3 /test/tmpfs-5m/statedir/content.sqlite
SQLite version 3.26.0 2018-12-01 12:34:55
sqlite> vacuum;
sqlite> .q
$ ls -lh /test/tmpfs-5m/statedir/
total 4.0M
-rw-r----- 1 chu11 chu11  20K Aug  5 23:33 content.sqlite

$ src/cmd/flux start -o,-Sstatedir=/test/tmpfs-5m/statedir
$ flux run hostname
fluxorama
$ flux run hostname
Aug 05 23:34:23.608985 UTC content-sqlite.err[0]: store: executing stmt: database or disk is full(13)

so that did do the trick to free up some space.

trws commented 3 months ago

Ouch, I forgot about the separate -wal file, that makes the whole thing a lot less plausible. We might legitimately be better off making ourselves a little loopback or fuse filesystem or something. It's not as nice for any number of reasons, but we could do it entirely without admin involvement if we need to. This is what I usually do to get a 1gb ext4 FWIW:

$ mkdir mnt
$ dd if=/dev/zero of=loopbackfile.img bs=100M count=10
$ mkfs.ext4 loopbackfile.img
$ unshare -rm bash
# fuse2fs loopbackfile.img mnt

The trick would be we'd have to both do the mount and launch the broker under the unshare. Not sure if it's worth it, but it would work.

chu11 commented 3 months ago

Confirmed that turning journaling off (PRAGMA journal_mode=OFF) the pre-allocation trick works.

chu11 commented 3 months ago

@trws I assume the goal of using unshare is to hide the mount from other users. Is it strictly necessary? Especially on rank 0, (i.e. mgmt node that normal users aren't on)?

garlick commented 3 months ago

Let's not do a loopback mount. The fact that our admins are inexplicably resistant to cordoning off space on elcapi doesn't seem like justification for adding this to Flux, and it would create complications for access from other than the broker, e.g. during recovery or backup.

Plus it feels like we're taking a "no" from the admins and turning it into a "yes" with jedi mind tricks. They control the flux config and the fstab. If I were an admin, I would summarily dismiss this option and either create the fstab entry or not.

trws commented 3 months ago

@trws I assume the goal of using unshare is to hide the mount from other users. Is it strictly necessary? Especially on rank 0, (i.e. mgmt node that normal users aren't on)?

That's a side-effect, but no that's not the reason. It's there to create a mount namespace so that we can mount the fuse filesystem without admin access. It makes it completely doable as just the flux user, in the systemd unit or a script or similar.

To your point @garlick it's clearly not at all ideal, I do get that. If we could grab a library and use it to just open a regular file as a backing store to use to let us do this, that would be a whole lot more satisfying, just trying to throw out whatever I can think of in hopes maybe we'll come up with something better. 😕

There are such things, but I don't know if there's a good way to hook them into sqlite. Maybe that's something to look at, alternate storage hooks backend for sqlite that would let just point it at a pre-allocated file? Will do some digging.

garlick commented 3 months ago

just trying to throw out whatever I can think of in hopes maybe we'll come up with something better.

Gotcha.

We might try again asking for a proper file system and avoid expending too much effort here.

chu11 commented 3 months ago

A random idea that crossed my mind, so I'm just going to throw it out there so i don't forget (super related to issue related to #6154).

The pre-allocate trick might be something we can use for "desperate" and "important situations", e.g. we're tearing down and trying to checkpoint final rootrefs. In those desperate situations, could just turn off write-ahead-log, etc. We could perhaps just allocate a few megs in the sqlite database for these desperate situations and delete it if needed.

garlick commented 3 months ago

Oh that's a good idea! Couldn't we just pre-allocate whatever large amount of space we want, then if we ever get ENOSPC, turn off the WAL and try again?

chu11 commented 3 months ago

Oh that's a good idea! Couldn't we just pre-allocate whatever large amount of space we want, then if we ever get ENOSPC, turn off the WAL and try again?

Oh that's a fair point, I didn't even think of that. If we hit ENOSPC, we could just turn off WAL in general.

I guess in my mind the general risks associated with not having WAL wasn't worth it in the general sense. We only want to use it in "desperate times".

garlick commented 3 months ago

Maybe that's something to look at, alternate storage hooks backend for sqlite that would let just point it at a pre-allocated file? Will do some digging.

The first hit that came up was this. Pluggable sqlite backend store is...a KVS! Yow. This is what it feels like when you realize you might have one two many abstraction layers.

chu11 commented 3 months ago

Oh that's a fair point, I didn't even think of that. If we hit ENOSPC, we could just turn off WAL in general.

Hmmmm. If we pre-allocate, is it safe if we turn off WAL and turn synchronous to "FULL"?

FULL (2) When synchronous is FULL (2), the SQLite database engine will use the xSync method of the VFS to ensure that all content is safely written to the disk surface prior to continuing. This ensures that an operating system crash or power failure will not corrupt the database. FULL synchronous is very safe, but it is also slower. FULL is the most commonly used synchronous setting when not in WAL mode.

from https://www.sqlite.org/pragma.html

I'm thinking it is .... I can't convince myself otherwise right now.

Edit: in otherwords, when we reach this ENOSPC bad state ... making sqlite slower is an acceptable trade off to ensure we continue operations.

Edit 2: Hmmmm, maybe not as safe as we'd like.

The OFF journaling mode disables the rollback journal completely. No rollback journal is ever created and hence there is never a rollback journal to delete. The OFF journaling mode disables the atomic commit and rollback capabilities of SQLite. The ROLLBACK command no longer works; it behaves in an undefined way. Applications must avoid using the ROLLBACK command when the journal mode is OFF. If the application crashes in the middle of a transaction when the OFF journaling mode is set, then the database file will very likely go corrupt. Without a journal, there is no way for a statement to unwind partially completed operations following a constraint error. This might also leave the database in a corrupted state. For example, if a duplicate entry causes a CREATE UNIQUE INDEX statement to fail half-way through, it will leave behind a partially created, and hence corrupt, index. Because OFF journaling mode allows the database file to be corrupted using ordinary SQL, it is disabled when SQLITE_DBCONFIG_DEFENSIVE is enabled.

So journal_mode = OFF is more risky. Setting journaling to PERSIST may be possible?

The PERSIST journaling mode prevents the rollback journal from being deleted at the end of each transaction. Instead, the header of the journal is overwritten with zeros. This will prevent other database connections from rolling the journal back. The PERSIST journaling mode is useful as an optimization on platforms where deleting or truncating a file is much more expensive than overwriting the first block of a file with zeros. See also: PRAGMA journal_size_limit and SQLITE_DEFAULT_JOURNAL_SIZE_LIMIT.

I'm not sure if we could "pre-allocate" a sqlite journal flie so that it is always there.

Or there could be a "chain" of riskiness as we get more and more desperate

garlick commented 3 months ago

Yeah, that's kind of what I meant. I have no idea if the transition is safe. Might have to close the database (which I assume would fully process the WAL) and reopen it again with different pragmas?

chu11 commented 3 months ago

Yeah, that's kind of what I meant. I have no idea if the transition is safe. Might have to close the database (which I assume would fully process the WAL) and reopen it again with different pragmas?

Yeah, I think we have to close & re-open with new pragmas. Which I don't think is a big deal, since sqlite is effectively disabled by the time we would try this anyways. i.e. if we can't re-open with new pragmas, we're not really any worse off than we were before.

garlick commented 3 months ago

Well if we have space reserved, we should be OK I would think! We should try it.

chu11 commented 3 months ago

Well if we have space reserved, we should be OK I would think! We should try it.

Agreed, some experimental work with these ideas is worthwhile. Gonna try some stuff out.

trws commented 3 months ago

To your point above @garlick, the sqlite4 default pluggable backend is now a log-structured merge tree (kvs basically) that actually uses some of the WAL methods even outside of WAL mode. Might be another thing to consider, especially if its side files are stable size.

chu11 commented 3 months ago

Agreed, some experimental work with these ideas is worthwhile. Gonna try some stuff out.

As an initial experiment, did the follow

pre-allocate a meg for sqlite
fill up statedir with crap
can run stuff but hit ENOSPC
exit local instance
recompile with journal mode OFF & synchronous FULL
re-start local instance (same statedir of course, so with same db)
can run jobs happily for awhile

So I think in principle, this can work. Lots of work to try and get this to work in the actual content-sqlite, but will start with an experimental build.

As an aside, using PERSIST journal mode didn't work. So it's going to have to be "OFF" + "FULL"

chu11 commented 3 months ago

I got a experimental "preallocate" working w/ content-sqlite. First things first, wanted to see if this impacts performance in a meaningful way. Initially it appeared there was a small performance dip.

In my docker container I would:

$ mkdir /tmp/statedir
$ src/cmd/flux start -o,-Sstatedir=/tmp/statedir
$ src/test/throughput.py -sn 16384

w/o pre-allocation my results are

throughput:     102.7 job/s (script: 102.4 job/s)
throughput:     94.6 job/s (script:  94.4 job/s)
throughput:     107.9 job/s (script: 107.6 job/s)

an average of 101.7 jobs/s

when I pre-allocate about ~1G of space

throughput:     95.9 job/s (script:  95.6 job/s)
throughput:     89.3 job/s (script:  89.0 job/s)
throughput:     101.9 job/s (script: 101.6 job/s)

an average of 95.7 jobs/s.

So around a 5.8% dip. Granted this is not the best test, running in a docker container on my laptop.

Then I tried 32K jobs instead of 16K to try and get a better averaging over time.

[chu11@fluxorama src]$ ls -lh /tmp/statedir ; src/test/throughput.py -sn 32768; ls -lh /tmp/statedir
total 120K
-rw-r----- 1 chu11 chu11 4.0K Aug  8 21:45 content.sqlite
-rw-r----- 1 chu11 chu11 113K Aug  8 21:45 content.sqlite-wal
 32768 jobs:  33478 submitted,      0 running, 32768 completed
number of jobs: 32768
submit time:    18.076s (1812.8 job/s)
script runtime: 511.485s
job runtime:    511.244s
throughput:     64.1 job/s (script:  64.1 job/s)
total 708M
-rw-r----- 1 chu11 chu11 704M Aug  8 21:54 content.sqlite
-rw-r----- 1 chu11 chu11 4.0M Aug  8 21:54 content.sqlite-wal

[chu11@fluxorama src]$ ls -lh /tmp/statedir ; src/test/throughput.py -sn 32768; ls -lh /tmp/statedir
total 1.9G
-rw-r----- 1 chu11 chu11 954M Aug  8 22:02 content.sqlite
-rw-r----- 1 chu11 chu11 961M Aug  8 22:02 content.sqlite-wal
 32768 jobs:  39071 submitted,      0 running, 32768 completed
number of jobs: 32768
submit time:    113.907s (287.7 job/s)
script runtime: 495.136s
job runtime:    494.585s
throughput:     66.3 job/s (script:  66.2 job/s)
total 1.9G
-rw-r----- 1 chu11 chu11 954M Aug  8 22:11 content.sqlite
-rw-r----- 1 chu11 chu11 961M Aug  8 22:11 content.sqlite-wal

The pre-allocate runs a bit faster (66.3 vs 64.1). So I think the minor variance appears to be in the acceptable error range of what we're expecting and shouldn't scare us into continuing on w/ the pre-allocate work.

(side note, I tried to 65536 jobs too, but it was so slow I gave up. I have to pre-allocate 2G of space instead of 1G, and

As an aside, just for experimental purposes, I tried to run this test w/ journal_mode "OFF" + synchronous "FULL" just to see how much performance impact it has (note, was not preallocating)

[chu11@fluxorama src]$ ls -lh /tmp/statedir ; src/test/throughput.py -sn 32768; ls -lh /tmp/statedir
total 20K
-rw-r----- 1 chu11 chu11 20K Aug  8 22:28 content.sqlite
 32768 jobs:  34024 submitted,      0 running, 32768 completed
number of jobs: 32768
submit time:    113.068s (289.8 job/s)
script runtime: 607.843s
job runtime:    606.158s
throughput:     54.1 job/s (script:  53.9 job/s)
total 730M
-rw-r----- 1 chu11 chu11 730M Aug  8 22:38 content.sqlite

that's a nice chunk below other results (~18%) and outside of our expected "error range".

Edit: hmmmmm at 64K jobs, pre-allocate was a lot slower. perhaps need to test on some LC machines that are beefier.

[chu11@fluxorama src]$ ls -lh /tmp/statedir ; src/test/throughput.py -sn 65536; ls -lh /tmp/statedir                                        
total 120K                                                                                                                                  
-rw-r----- 1 chu11 chu11 4.0K Aug  8 23:07 content.sqlite                                                                                   
-rw-r----- 1 chu11 chu11 113K Aug  8 23:07 content.sqlite-wal                                                                               
 65536 jobs:  75272 submitted,      0 running, 65536 completed                                                                              
number of jobs: 65536                                                                                                                       
submit time:    175.457s (373.5 job/s)                                                                                                      
script runtime: 1716.859s                                                                                                                   
job runtime:    1384.939s                                                                                                                   
throughput:     47.3 job/s (script:  38.2 job/s)                                                                                            
total 1.8G                                                                                                                                  
-rw-r----- 1 chu11 chu11 1.8G Aug  8 23:30 content.sqlite                                                                                   
-rw-r----- 1 chu11 chu11 4.0M Aug  8 23:30 content.sqlite-wal                                                                               

[chu11@fluxorama src]$ ls -lh /tmp/statedir ; src/test/throughput.py -sn 65536; ls -lh /tmp/statedir                                        
total 3.8G                                                                                                                                  
-rw-r----- 1 chu11 chu11 1.9G Aug  8 23:41 content.sqlite                                                                                   
-rw-r----- 1 chu11 chu11 1.9G Aug  8 23:41 content.sqlite-wal                                                                               
 65536 jobs:  73590 submitted,      0 running, 65536 completed                                                                              
number of jobs: 65536                                                                                                                       
submit time:    227.151s (288.5 job/s)                                                                                                      
script runtime: 1942.600s                                                                                                                   
job runtime:    1937.242s                                                                                                                   
throughput:     33.8 job/s (script:  33.7 job/s)                                                                                            
total 3.9G                                                                                                                                  
-rw-r----- 1 chu11 chu11 2.0G Aug  9 00:15 content.sqlite                                                                                   
-rw-r----- 1 chu11 chu11 1.9G Aug  9 00:15 content.sqlite-wal

Edit2: at 32K jobs, performance identical on corona

>rm -rf /tmp/achu/statedir/* ; src/cmd/flux start -o,-Sstatedir=/tmp/achu/statedir/ "ls -lh /tmp/achu/statedir; src/test/throughput.py -sn 32768; ls -lh /tmp/achu/statedir"
total 132K
-rw-r--r-- 1 achu achu 4.0K Aug  8 21:11 content.sqlite
-rw-r--r-- 1 achu achu 113K Aug  8 21:11 content.sqlite-wal
 32768 jobs:  33113 submitted,      0 running, 32768 completed
number of jobs: 32768
submit time:    26.111s (1255.0 job/s)
script runtime: 1516.388s
job runtime:    1514.940s
throughput:     21.6 job/s (script:  21.6 job/s)
total 357M
-rw-r--r-- 1 achu achu 353M Aug  8 21:36 content.sqlite
-rw-r--r-- 1 achu achu 4.0M Aug  8 21:36 content.sqlite-wal

>rm -rf /tmp/achu/statedir/* ; src/cmd/flux start -o,-Sstatedir=/tmp/achu/statedir/ "ls -lh /tmp/achu/statedir; src/test/throughput.py -sn 32768; ls -lh /tmp/achu/statedir"
total 2.0G
-rw-r--r-- 1 achu achu 954M Aug  8 22:18 content.sqlite
-rw-r--r-- 1 achu achu 961M Aug  8 22:18 content.sqlite-wal
 32768 jobs:  33175 submitted,      0 running, 32768 completed
number of jobs: 32768
submit time:    29.227s (1121.2 job/s)
script runtime: 1518.172s
job runtime:    1516.772s
throughput:     21.6 job/s (script:  21.6 job/s)
total 1.9G
-rw-r--r-- 1 achu achu 954M Aug  8 22:43 content.sqlite
-rw-r--r-- 1 achu achu 961M Aug  8 22:43 content.sqlite-wal

chu11 commented 3 months ago

just posted a WIP in #6217, but wanted to bring up for discussion the following important thing.

Pre-allocate only will work if journaling is not used. Code in #6217 will disable journaling if it is currently enabled and ENOSPC is hit. Per previous comments, that's a healthy performance hit, 10-20% is a good guestimate on lost job throughput. So we'd want to re-enable journaling once the bad actor filling up disk is corrected and disk space is available again.

Brainstorming ...

some command like flux content backend sqlite recover and that sends an RPC to content-sqlite and tells it to close/re-open db
some far more general command like flux recovery ENOSPC, that can tell multiple components to recover (i.e. sends magical RPC regardless of the backing store)
something in content-sqlite monitors the diskspace and will re-enable original configs if space becomes available (don't like this idea ... like how much diskspace is safe, etc. etc.)

maybe there's something more clever

garlick commented 3 months ago

Maybe flux config reload could trigger a database close/open when necessary to set new pragmas OR when config says/implies that WAL should be in use but it is not (e.g. has been turned off at runtime due to ENOSPC)?

chu11 commented 3 months ago

Maybe flux config reload could trigger a database close/open when necessary to set new pragmas OR when config says/implies that WAL should be in use but it is not (e.g. has been turned off at runtime due to ENOSPC)?

ahhh. That's a good idea. I didn't think of doing it that way. I guess in my mind the question is how to inform the admins something in flux is running in a bit of a "degraded mode" and needs to be improved, thus my thinking of a flux fix-me/recover-me kinda command. But perhaps just saying "hey do flux config reload" is good enough?

Edit: hmmm, I guess this solution would require admins to configure journal_mode and synchronous mode in a content-sqlite.toml file. Which is not super-wonderful, but is not horrible, especially if they are pre-allocating space anyways.

chu11 commented 2 months ago

So wanting to test how fast preallocate was going to be once its used in production, preallocated about 50g and stuck some timings in rc1 to get some estimates. This is in my fluxorama image on my laptop.

[chu11@fluxorama src]$ rm -rf statedir/*; src/cmd/flux start -o,-Sstatedir=/usr/src/statedir flux run hostname; cat /tmp/timing; ls -lh /usr/src/statedir
fluxorama
Tue Aug 20 22:01:25 UTC 2024
Tue Aug 20 22:09:20 UTC 2024
total 47G
-rw-r----- 1 chu11 chu11 47G Aug 20 22:09 content.sqlite
[chu11@fluxorama src]$ rm -rf statedir/*; src/cmd/flux start -o,-Sstatedir=/usr/src/statedir flux run hostname; cat /tmp/timing; ls -lh /usr/src/statedir
fluxorama
Tue Aug 20 22:15:24 UTC 2024
Tue Aug 20 22:22:43 UTC 2024
total 47G
-rw-r----- 1 chu11 chu11 47G Aug 20 22:22 content.sqlite
[chu11@fluxorama src]$ rm -rf statedir/*; src/cmd/flux start -o,-Sstatedir=/usr/src/statedir flux run hostname; cat /tmp/timing; ls -lh /usr/src/statedir
fluxorama
Tue Aug 20 22:32:06 UTC 2024
Tue Aug 20 22:38:53 UTC 2024
total 47G
-rw-r----- 1 chu11 chu11 47G Aug 20 22:38 content.sqlite

eek, that's sorta bad. on average we're looking at around 6-8 minutes.

on corona, I got 6-11 minutes

sooooo not great.

I'm not sure if this is a deal breaker or not. I was going to try smaller "block sizes" to see if they have any effect.

But perhaps we need to look at alternate ideas. One could be to pre-allocate while flux is running vs on initialization.

Update: doh! 64 meg blocks even worse, generally speaking 10+ minutes on mylaptop

garlick commented 2 months ago

I haven't really looked at the code closely so you might already be doing this but:

using zeroblob to create the data https://stackoverflow.com/questions/827010/sqlite-pre-allocating-database-size
disable the journal since WAL would double the amount of data landing on disk
set synchronous=OFF since the database is going to be closed at the end and that should flush everything out at once

chu11 commented 2 months ago

using zeroblob to create the data https://stackoverflow.com/questions/827010/sqlite-pre-allocating-database-size

yup (well `sqlite3_bind_zeroblob() which is hopefully identical)

disable the journal since WAL would double the amount of data landing on disk

yup

set synchronous=OFF since the database is going to be closed at the end and that should flush everything out at once

Ahhh good guess. I was doing FULL b/c that is the "safe" operation. But given we're creating junk data for pre-allocation that shouldn't be necessary. This got the 50g preallocate to a little sub-3 minutes in my container. Perhaps that's a tolerable time? Of course longer times for 100g if we end up going that big.

Edit: on corona, times are hovering 5-6 mins now ... a bit better

chu11 commented 2 months ago

So just brainstorming some preallocation ideas that would be faster.

preallocate in background after content-sqlite starts up? Would require us to start the database in non-journaling mode then convert to journaling mode later. So some medium trickery here.
similar to above, preallocate but w/ WAL on. we have to write small blocks, and "checkpoint" them so we don't run out of space with the WAL. The primary trickery here is that the config of journaling mode from the user is sort of ignored.
(half wild idea) revisit "online" garbage collection of KVS, so this preallocate effectively becomes a one time cost.
- i guess technically "online" means do not re-create the DB. It could technically be offline, like hypothetically garbage cleanup in rc3, remove entries we don't need and sqlite vacuum them to reclaim space.
(medium wild idea) "vendor" sqlite and we can increase the maximum blob size from 1 billion to something larger bytes. Although I guess this wouldn't help too much. I think the giant fsync() is the real problem on performance.

New ideas

random idea: Instead of pre-allocating via a new database, pre-allocate to the primary database, and just delete that data.
random idea: preallocate as we go. Hypothetically, we always want 1G more space than we're currently at.

New idea2

preallocate "above" current, i.e. if current is 10G, only preallocate 10G above that, vs a fixed amount

chu11 commented 2 months ago

some extra testing

increasing sqlite page size from its default 4K to 64K improved things a tad (perhaps 6 mins to 5 mins on corona).

+    if (sqlite3_exec (ctx->db,
+                      "PRAGMA page_size=65536",
+                      NULL,
+                      NULL,
+                      NULL) != SQLITE_OK) {
+        log_sqlite_error (ctx, "setting sqlite 'page_size' pragma");
+        goto error;
+    }

unfortunately 64K is the max page size for sqlite.

collapsing all writes into a single transaction didn't seem to do much, which in hindsight makes sense. Given 50g isn't only on the order of 50-75 writes of 800-900 megs. Not enough to make a huge impact.

diff --git a/src/modules/content-sqlite/content-sqlite.c b/src/modules/content-sqlite/content-sqlite.c
index 172fc33b1..2ccf2c18f 100644
--- a/src/modules/content-sqlite/content-sqlite.c
+++ b/src/modules/content-sqlite/content-sqlite.c
@@ -867,6 +867,15 @@ static int content_sqlite_preallocate (struct content_sqlite *ctx)
         goto error;
     }

+    if (sqlite3_exec (ctx->db,
+                      "BEGIN TRANSACTION",
+                      NULL,
+                      NULL,
+                      NULL) != SQLITE_OK) {
+        log_sqlite_error (ctx, "begin preallocate transaction");
+        goto error;
+    }
+
     while (preallocate_count > 0) {
         if (preallocate_count < PREALLOCATE_BLOBSIZE)
             blobsize = preallocate_count;
@@ -885,6 +894,15 @@ static int content_sqlite_preallocate (struct content_sqlite *ctx)
         preallocate_count -= blobsize;
     }

+    if (sqlite3_exec (ctx->db,
+                      "END TRANSACTION",
+                      NULL,
+                      NULL,
+                      NULL) != SQLITE_OK) {
+        log_sqlite_error (ctx, "end preallocate transaction");
+        goto error;
+    }
+
     if (sqlite3_exec (ctx->db,
                       sql_drop_table_preallocate,
                       NULL,

tried to increase cache size, thinking maybe that'll help with whatever indexing or whatever sqlite does internally. didn't seem to do much.

diff --git a/src/modules/content-sqlite/content-sqlite.c b/src/modules/content-sqlite/content-sqlite.c
index 172fc33b1..58ba6830c 100644
--- a/src/modules/content-sqlite/content-sqlite.c
+++ b/src/modules/content-sqlite/content-sqlite.c
@@ -751,6 +751,14 @@ static int content_sqlite_opendb (struct content_sqlite *ctx,
         log_sqlite_error (ctx, "setting sqlite 'page_size' pragma");
         goto error;
     }
+    if (sqlite3_exec (ctx->db,
+                      "PRAGMA cache_size=2048",
+                      NULL,
+                      NULL,
+                      NULL) != SQLITE_OK) {
+        log_sqlite_error (ctx, "setting sqlite 'page_size' pragma");
+        goto error;
+    }

(side note that 2048 is number of pages, so that 2048*64K for cache size)

Thought maybe using only preparde statements w/ pre-defined zeroblobs might be faster, but that didn't seem to do much.

-const char *sql_preallocate_fill = "INSERT INTO preallocate (data) values (?1)";
+const char *sql_preallocate_fill = "INSERT INTO preallocate (data) values (zeroblob(1024*1024*32))";

Smaller "blocks" of writes didn't seem to change much.

chu11 commented 2 months ago

i noticed something as I was playing around. Some timings I stuck into the code for debug. For a pre-allocation of 50g

Aug 21 15:17:12.135993 PDT content-sqlite.crit[0]: start 1724278632
Aug 21 15:17:12.136003 PDT content-sqlite.crit[0]: loop 1724278632
Aug 21 15:17:48.876913 PDT content-sqlite.crit[0]: drop 1724278668
Aug 21 15:20:13.628756 PDT content-sqlite.crit[0]: pre end 1724278813
Aug 21 15:20:13.633270 PDT content-sqlite.crit[0]: end 1724278813

the actual time to open db, write cruft, drop db, close db, takes about 3 minutes (181 seconds). The majority of the time is in the drop table command (145 seconds) vs the loop to write the data (36 seconds).

That is probably a clue. Initial guess, the writes have not yet internally completed whatever is going on (fsync-ing? indexing kruft?), so the drop command has to wait for their completion? Not sure what. one experimental idea I had was to create all the kruft, then delete it, then drop the table. Dunno if that would speed things up.

But also, my call to date before and after the flux module load content-sqlite takes

Wed Aug 21 15:17:12 PDT 2024
Wed Aug 21 15:22:18 PDT 2024

about 306 seconds.

So something else appears to be chewing up time unrelated to the pre-allocation. I haven't dug into what yet. Dunno if mountains of memory pages taken up and there's something else going on in there.

chu11 commented 2 months ago

Ahhhh ... re-opening the database a second time is a huge part of the slowdown. That adds 2 minutes on top of the pre-allocate.

Aug 21 17:02:01.670400 PDT content-sqlite.crit[0]: start 1724284921
Aug 21 17:02:01.670409 PDT content-sqlite.crit[0]: loop 1724284921
Aug 21 17:02:40.769235 PDT content-sqlite.crit[0]: drop 1724284960
Aug 21 17:05:07.919787 PDT content-sqlite.crit[0]: pre end 1724285107
Aug 21 17:05:07.924535 PDT content-sqlite.crit[0]: end 1724285107
Aug 21 17:05:07.924543 PDT content-sqlite.crit[0]: opendb again 1724285107
Aug 21 17:07:17.134969 PDT content-sqlite.crit[0]: done opendb again 1724285237

That also means the current implementation in the PR, which would close the db and re-open it if we hit ENOSPC, would be quite slow.

Hmmm, gonna take some thought.

garlick commented 2 months ago

@chu11, I think we should pause this idea. It's starting to sound like more trouble than it's worth.

chu11 commented 2 months ago

It's starting to sound like more trouble than it's worth.

Well I was thinking about this last night and this morning. Perhaps this discussion should be more about pros vs cons or tradeoffs we want to accept.

If surviving ENOSPC is something we care about, then there simply has to be a cost.

Current implementation I have, that cost is increased startup time.

If we accept smaller ENOSPC survival odds, we can decrease startup time via a smaller preallocation.

We can reduce startup time if we trade off job throughput (i.e. default from journal_mode=WAL to OFF).

If we implement one of several of the brainstorming ideas I list above (e.g. preallocate background) we tradeoff ENOSPC survival odds (i.e. hit ENOSPC while we are preallocating).

etc. etc.

In my mind, an extra 5 mins startup time doesn't seem like a deal breaker for preallocating 50g. I dunno ... is it?

(as an aside, I just thought of a new idea in my brainstorm list above. instead of preallocating a fixed amount of space ... pre-allocating an amount of space larger than the current. i.e. "10g more" than whatever the current db size is. this could even be done in rc1 but checking the old DB size and doing some math)

Edit: I just saw you commented in the PR, didn't see that till after I wrote this.

flux-framework / flux-core

content-sqlite: preallocate database space #6169

6155

6154

6100

6099

5978 (edit via @chu11)