benbjohnson / litestream

Streaming replication for SQLite.
https://litestream.io
Apache License 2.0
10.69k stars 246 forks source link

Read-only replicas from cold storage #357

Open benbjohnson opened 2 years ago

benbjohnson commented 2 years ago

Litestream v0.4.0 adds low-latency, live read replication via an HTTP endpoint. However, it could also be useful for users to have higher latency replication via cold storage (e.g. S3) when a direct HTTP endpoint is not feasible. This approach would require replicas to create a polling litestream.StreamClient that periodically checks for both new WAL files in the current generation as well as checking for new generations.

This approach can be more expensive as it requires frequent checks to an API. S3, for example, charges $0.000005 per LIST request and $0.000004 per GET request. It also charges $0.09 per GB of egress. If a replica is polling every 10 seconds with a LIST request then this comes out to $1.30 per replica per month plus whatever GET & bandwidth cost are incurred.

anacrolix commented 2 years ago

My use case is a very large database ~40-60 GB, that is fine to sync as often as is reasonable, but it's not real time. To be behind by 5 mins to 2 hours would be fine. Are there constraints around having all the WAL files written since the last sync still be around?

benbjohnson commented 2 years ago

Are there constraints around having all the WAL files written since the last sync still be around?

If you don't have all the WAL files since the last sync then you can't perform a recovery. Litestream could compact WAL files so it removes duplicate pages if you're concerned about limiting the amount of storage needed.

anacrolix commented 2 years ago

No I think I'd be fine with just keeping WAL files alive for the longest expected sync lag.

benbjohnson commented 2 years ago

No I think I'd be fine with just keeping WAL files alive for the longest expected sync lag.

It does make a weird failure case where a replica that is disconnected for longer than the max retention won't be able to re-sync until the database snapshots again. I'll have to think about how to structure that.

gedw99 commented 2 years ago

Am also interested in this topology .

would we also need a proxy like caddy L4 to send the reads to the nearest read db and the writes to the primary db ?

Also I don’t know if caddy L4 proxy supports clustering but that’s a different can of worms .

Lastly I know the huge Selling point of litestream is to only need one server but that proxy could also run a golang message broker thingy so that s3 ( minio ) changes can be sent to all litestream read only instances. But I can also see the extra layer this adds that is not in keeping with the “less is more” philosophy of litestream.

would be great to get feedback on this idea

ak2k commented 2 years ago

One caveat on the AWS cost figures above would be that, to the extent lightstream is running on EC2 in the same region as the S3 bucket, no data transfer fees should apply.

... There is no Data Transfer charge for data transferred between Amazon EC2 (or any AWS service) and Amazon S3 within the same region ... (https://aws.amazon.com/s3/faqs/#Billing)

markuswustenberg commented 1 year ago

As many/all of you are probably aware already, adding Cloudflare R2 to the mix may change some cost calculations. For those who don't, see https://www.cloudflare.com/en-gb/products/r2/ .

rupurt commented 1 year ago

This topology would be super useful for big data publishing and then serving it at the edge. i.e.