Open sirupsen opened 2 years ago
hey @sirupsen! 👋 It's been a long time indeed! Still at Shopify?
Restore performance is something that needs improvement but I'm surprised it's 10s for such a small workload. You can change the retention
and snapshot-interval
configuration fields to create a snapshot more frequently. If you don't care about keeping historic data, you can just set retention
to 1h
and it'll just keep one snapshot that's an hour old at most.
If you do want to retain data longer, then you could set retention
to 24h
(which is the default anyway) and the snapshot-interval
to 1h
. That'll make a new snapshot every hour but only keep them for a rolling 24 hours. Here's the docs for those settings: https://litestream.io/reference/config/#replica-settings
Another option you could try is setting the -parallel N
flag on the litestream restore
command. If you set that to something high like 64
then it should speed up the downloads at least.
Finally, there are plans for maintaining a hot backup so you can restore instantly but that's still a few months away. I'm also working on a version that works in a serverless environment but that's probably going to be ready later in the year.
I stopped working at Shopify mid last year, doing infra consulting now :)
Thank youuuu! snapshot-interval
is exactly what I needed... I somehow missed that in the docs. I can add a section to the docs in the Tips & Caveats
section if you'd accept it? That said, -parallelism
did help a lot to speed up getting ~24h worth of WALs.
FWIW I'm using Cloud Run, so for me, it's already working in serverless 😉
I'm stoked for hot standbys, and maybe one day the ability to 'merge' instances would be cool too.
I can add a section to the docs in the Tips & Caveats section if you'd accept it?
Yes! That'd be awesome. Thanks, Simon.
FWIW I'm using Cloud Run, so for me, it's already working in serverless
Cool. I saw some folks talking about getting Litestream running on Cloud Run but I haven't had a chance to give it a go yet.
The idea of "serverless SQLite" that I'm thinking of is paging in data on-demand in a way that's transactionally safe. That way it'd give you zero startup time but also low-latency queries once data is hot on a serverless instance. I'm still toying around with the idea but I think it might have some legs.
@sirupsen I saw your comment in https://github.com/benbjohnson/litestream/discussions/223#discussioncomment-1977888 but I'm moving the discussion back over to this ticket.
I am seeing it being stuck on restore too. Is there a good debugging step I can take? 👀
Can you hit CTRL-\
to issue a SIGQUIT
when it gets stuck for a bit? That should dump out a stack trace that'll tell us what it's stuck on.
@benbjohnson Sorry, I might have misused the word 'stuck'—it doesn't get stuck in a loop inside Litestream, just stuck in a loop trying to boot the container by restoring Litestream. litestream restore
just exits after a few seconds with the same error as @pfw:
cannot find max wal index for restore: missing initial wal segment: generation=4f2abd0f421cf473 index=00001c73 offset=1080
This is the stacktrace I get from sending SIGQUIT
just before it exits. I got a few stacktraces by sending SIGQUIT
before it terminates, and they all look like that
I have nothing sensitive in this database, so I've DM'ed you a zip of the generations
directory on the Litestream slack 👍🏻
@sirupsen The missing initial wal segment
issue from @pfw turned out to be two applications replicating into the same bucket and their retention enforcement was deleting each other's WAL segments: https://github.com/benbjohnson/litestream/issues/224#issuecomment-881794809
I think the issue might be that GCR doesn't enforce a single instance at a time and there could be overlap—especially when deploying—that's causing issues. I think GCR isn't going to work well until I can get better support for serverless in Litestream.
I'm not sure if you're committed to GCR but another good alternative is fly.io. If you attach a persistent disk on their instances then they enforce a single instance at a time.
Fair enough... I will consider migrating to fly.io. 👍🏻
How do I fix this error though even when nothing is running on GCP? To recover my dear database
How do I fix this error though even when nothing is running on GCP? To recover my dear database
Unfortunately, with the missing initial WAL segment the best you can do is recover from the last snapshot.
# Copy out the last snapshot.
cp generations/f6d6d1e96d38dafb/snapshots/00000093.snapshot.lz4 db.lz4
# Uncompress it
lz4 db.lz4
# Verify the database integrity
sqlite3 db
sqlite> PRAGMA integrity_check;
ok
I don't deserve you, thank you :)
Thanks for going on this debugging journey with me, @sirupsen! The doc updates are incredibly helpful. 🎉
@sirupsen do you still have a test case you could throw at #416? I'm curious if you just hit an ordering bottleneck when retrieving WAL segments.
@hifi it's very fast for me these days despite the database being far larger. Probably with improper snapshot intervals and retention you might be able to make it slow!
Can this be closed?
Hey @benbjohnson, long time no see!! Thank you for working on Litestream! 🙏🏻
I, too, love Sqlite. I wanted to track a few events on my website, e.g. what people search for, and saw this as an opportunity to use Litestream. Loved the idea of tracking events in Sqlite and just do analysis on a local copy.
However, even though my db is only ~100kb on disk and ~1000 rows over a few days, it takes ~10 seconds to restore with
litestream restore
, and this is going up fast.Is there a plan for a
litestream compress
or similar to avoid replaying the WAL from early on, similar to what databases do when the WAL gets big enough? Or am I doing something wrong? Unfortunately this will be a bit of a deal-breaker to me using this in production :(