benbjohnson / litestream

Streaming replication for SQLite.
https://litestream.io
Apache License 2.0
11.1k stars 256 forks source link

Restored database fails integrity checks #582

Open kalafut opened 7 months ago

kalafut commented 7 months ago

I'm using version v0.3.13 in a mostly vanilla fashion: single server replicating to S3. The only modifications have been to the retention setup:

retention: 168h
snapshot-interval: 24h

Things had been working fine, including test restores. Even now replication is proceeding as normal. But about a week ago, I did a test restore and saw some database integrity errors. When I ran pragma integrity_check I saw stuff like:

*** in database main ***
Tree 88 page 88 cell 46: Rowid 5452 out of order
Tree 88 page 35: btreeInitPage() returns error code 11
Tree 88 page 88 cell 43: Rowid 5436 out of order
Tree 88 page 88 cell 39: Rowid 5402 out of order
Tree 88 page 88 cell 38: Rowid 5393 out of order
Tree 88 page 88 cell 37: Rowid 5384 out of order
...
Tree 88 page 88 cell 7: Rowid 5233 out of order
Tree 88 page 91 cell 1: 2nd reference to page 35
row 22 missing from index idx_objects_user_id_modified_at
row 34 missing from index sqlite_autoindex_objects_1
row 35 missing from index sqlite_autoindex_objects_1
row 36 missing from index sqlite_autoindex_objects_1
...
row 49 missing from index idx_objects_user_id_modified_at
row 49 missing from index sqlite_autoindex_objects_1
row 50 missing from index idx_objects_user_id_modified_at
row 50 missing from index sqlite_autoindex_objects_1

I restarted the Litestream service and just let things run, thinking maybe it was a glitchy WAL and would be flushed out (still not great, but...). A week later, a freshly restored file still exhibits the same issues. When I ssh into the server and run pragma integrity_check on the actual database, it always returns OK.

I'm sure if I just wiped/vacuumed/restarted everything, thing would probably again run fine, but that's not answering any questions. Are there any diagnostics you'd like me to run on this bad state?

benbjohnson commented 7 months ago

Hmm, that's not good. Debugging failing integrity checks can be complicated and there's not built in debugging tooling in Litestream. Do you know if you're using any extensions with your SQLite database? It shouldn't matter but I figured I'd ask to double check.

kalafut commented 7 months ago

No extensions are used.

What is interesting is that this occurs with an hours-old snapshot and just a few WALs:

➜  /tmp litestream restore -v -o scorekeeper.db s3://xxxxxxxxxxxxxxxx/sync_server
2024/04/22 13:54:48.459675 s3: restoring snapshot 0734a376671bf473/00000003 to scorekeeper.db.tmp
2024/04/22 13:54:48.550590 s3: restoring wal files: generation=0734a376671bf473 index=[00000003,00000004]
2024/04/22 13:54:48.655964 s3: downloaded wal 0734a376671bf473/00000004 elapsed=104.373292ms
2024/04/22 13:54:50.779728 s3: downloaded wal 0734a376671bf473/00000003 elapsed=2.228415666s
2024/04/22 13:54:50.792547 s3: applied wal 0734a376671bf473/00000003 elapsed=12.437458ms
2024/04/22 13:54:50.793711 s3: applied wal 0734a376671bf473/00000004 elapsed=1.15025ms
2024/04/22 13:54:50.793721 s3: renaming database from temporary location

I downloaded the snapshot from S3 and decompressed it (🎉 that stock lz4 worked). Integrity checks passed fine. I don't know how do to anything with the WAL files though. I'm happy to ship any of these files to a team member for dissection.