G-Research / fasttrackml

Experiment tracking server focused on speed and scalability
https://fasttrackml.io/
Apache License 2.0
97 stars 18 forks source link

WAL file keeps growing until server is restarted #446

Open jgiannuzzi opened 10 months ago

jgiannuzzi commented 10 months ago

We are experiencing checkpoint starvation, as described in https://www.sqlite.org/wal.html#avoiding_excessively_large_wal_files. In combination with #445, this means that a FastTrackML server running on Kubernetes with the SQLite backend will have an ever-growing WAL file and will quickly run into disk space issues!

suprjinx commented 10 months ago

perhaps starting the server goroutine with a timeout context, which makes it periodically exit -- then do the truncate and restart (in a loop)?

jgiannuzzi commented 10 months ago

I got a POC running that simply tunes the connection pools so that we don't keep idle connections constantly open and this works fine — however I want to make sure I understand the potential performance implications of that

dsuhinin commented 6 months ago

perhaps starting the server goroutine with a timeout context, which makes it periodically exit -- then do the truncate and restart (in a loop)?

ohh, it sounds like a NodeJS application :)

suprjinx commented 2 months ago

@jgiannuzzi should we PR your branch?