Restore is slow - Githubissues

sirupsen commented 2 years ago

Hey @benbjohnson, long time no see!! Thank you for working on Litestream! 🙏🏻

I, too, love Sqlite. I wanted to track a few events on my website, e.g. what people search for, and saw this as an opportunity to use Litestream. Loved the idea of tracking events in Sqlite and just do analysis on a local copy.

However, even though my db is only ~100kb on disk and ~1000 rows over a few days, it takes ~10 seconds to restore with litestream restore, and this is going up fast.

Is there a plan for a litestream compress or similar to avoid replaying the WAL from early on, similar to what databases do when the WAL gets big enough? Or am I doing something wrong? Unfortunately this will be a bit of a deal-breaker to me using this in production :(

benbjohnson commented 2 years ago

hey @sirupsen! 👋 It's been a long time indeed! Still at Shopify?

Restore performance is something that needs improvement but I'm surprised it's 10s for such a small workload. You can change the retention and snapshot-interval configuration fields to create a snapshot more frequently. If you don't care about keeping historic data, you can just set retention to 1h and it'll just keep one snapshot that's an hour old at most.

If you do want to retain data longer, then you could set retention to 24h (which is the default anyway) and the snapshot-interval to 1h. That'll make a new snapshot every hour but only keep them for a rolling 24 hours. Here's the docs for those settings: https://litestream.io/reference/config/#replica-settings

Another option you could try is setting the -parallel N flag on the litestream restore command. If you set that to something high like 64 then it should speed up the downloads at least.

Finally, there are plans for maintaining a hot backup so you can restore instantly but that's still a few months away. I'm also working on a version that works in a serverless environment but that's probably going to be ready later in the year.

sirupsen commented 2 years ago

I stopped working at Shopify mid last year, doing infra consulting now :)

Thank youuuu! snapshot-interval is exactly what I needed... I somehow missed that in the docs. I can add a section to the docs in the Tips & Caveats section if you'd accept it? That said, -parallelism did help a lot to speed up getting ~24h worth of WALs.

FWIW I'm using Cloud Run, so for me, it's already working in serverless 😉

I'm stoked for hot standbys, and maybe one day the ability to 'merge' instances would be cool too.

benbjohnson commented 2 years ago

I can add a section to the docs in the Tips & Caveats section if you'd accept it?

Yes! That'd be awesome. Thanks, Simon.

FWIW I'm using Cloud Run, so for me, it's already working in serverless

Cool. I saw some folks talking about getting Litestream running on Cloud Run but I haven't had a chance to give it a go yet.

The idea of "serverless SQLite" that I'm thinking of is paging in data on-demand in a way that's transactionally safe. That way it'd give you zero startup time but also low-latency queries once data is hot on a serverless instance. I'm still toying around with the idea but I think it might have some legs.

benbjohnson commented 2 years ago

@sirupsen I saw your comment in https://github.com/benbjohnson/litestream/discussions/223#discussioncomment-1977888 but I'm moving the discussion back over to this ticket.

I am seeing it being stuck on restore too. Is there a good debugging step I can take? 👀

Can you hit CTRL-\ to issue a SIGQUIT when it gets stuck for a bit? That should dump out a stack trace that'll tell us what it's stuck on.

sirupsen commented 2 years ago

@benbjohnson Sorry, I might have misused the word 'stuck'—it doesn't get stuck in a loop inside Litestream, just stuck in a loop trying to boot the container by restoring Litestream. litestream restore just exits after a few seconds with the same error as @pfw:

cannot find max wal index for restore: missing initial wal segment: generation=4f2abd0f421cf473 index=00001c73 offset=1080

This is the stacktrace I get from sending SIGQUIT just before it exits. I got a few stacktraces by sending SIGQUIT before it terminates, and they all look like that

I have nothing sensitive in this database, so I've DM'ed you a zip of the generations directory on the Litestream slack 👍🏻

CleanShot 2022-01-17 at 07 47 43

benbjohnson commented 2 years ago

@sirupsen The missing initial wal segment issue from @pfw turned out to be two applications replicating into the same bucket and their retention enforcement was deleting each other's WAL segments: https://github.com/benbjohnson/litestream/issues/224#issuecomment-881794809

I think the issue might be that GCR doesn't enforce a single instance at a time and there could be overlap—especially when deploying—that's causing issues. I think GCR isn't going to work well until I can get better support for serverless in Litestream.

I'm not sure if you're committed to GCR but another good alternative is fly.io. If you attach a persistent disk on their instances then they enforce a single instance at a time.

sirupsen commented 2 years ago

Fair enough... I will consider migrating to fly.io. 👍🏻

How do I fix this error though even when nothing is running on GCP? To recover my dear database

benbjohnson commented 2 years ago

How do I fix this error though even when nothing is running on GCP? To recover my dear database

Unfortunately, with the missing initial WAL segment the best you can do is recover from the last snapshot.

# Copy out the last snapshot.
cp generations/f6d6d1e96d38dafb/snapshots/00000093.snapshot.lz4 db.lz4

# Uncompress it
lz4 db.lz4

# Verify the database integrity
sqlite3 db
sqlite> PRAGMA integrity_check;
ok

sirupsen commented 2 years ago

I don't deserve you, thank you :)

https://github.com/benbjohnson/litestream.io/pull/43

benbjohnson commented 2 years ago

Thanks for going on this debugging journey with me, @sirupsen! The doc updates are incredibly helpful. 🎉

hifi commented 2 years ago

@sirupsen do you still have a test case you could throw at #416? I'm curious if you just hit an ordering bottleneck when retrieving WAL segments.

sirupsen commented 2 years ago

@hifi it's very fast for me these days despite the database being far larger. Probably with improper snapshot intervals and retention you might be able to make it slow!

northeastprince commented 1 month ago

Can this be closed?

benbjohnson / litestream

Restore is slow #266