benbjohnson / litestream

Streaming replication for SQLite.
https://litestream.io
Apache License 2.0
11.03k stars 252 forks source link

Google Cloud Run: cannot apply wal: disk I/O error: invalid argument #183

Closed magnuswahlstrand closed 2 years ago

magnuswahlstrand commented 3 years ago

Hi!

First of all, thank you for a great piece of software.

It might be a terrible to try to run litestream and SQLite on Cloud Run. so feel free to close this issue ๐Ÿ‘

Problem

I'm trying to run get the litestream+s6 example working on Google Cloud Run (from https://github.com/benbjohnson/litestream-s6-example). If I run the container on Cloud Run, I get the error below

s3: restoring snapshot 9e3296261a575b63/00000000 to /tmp/db.tmp
s3: restoring wal files: generation=9e3296261a575b63 index=[00000000,00000001]
s3: downloaded wal 9e3296261a575b63/00000001 elapsed=278.371883ms
s3: downloaded wal 9e3296261a575b63/00000000 elapsed=355.99923ms
cannot apply wal: disk I/O error: invalid argument
[cont-init.d] 00-litestream: exited 1.

If I run the container locally, it works well

...
s3: downloaded wal 9e3296261a575b63/00000000 elapsed=832.108ms
s3: applied wal 9e3296261a575b63/00000000 elapsed=11.0846ms
s3: applied wal 9e3296261a575b63/00000001 elapsed=10.6193ms
s3: renaming database from temporary location
[cont-init.d] 00-litestream: exited 0.

Any idea what might be the problem?

I'm guessing this might be due to Cloud Run's in memory file system, but I don't know how to fix it (https://cloud.google.com/appengine/docs/standard/go/using-temp-files).

benbjohnson commented 3 years ago

I agree that running SQLite & Litestream on Cloud Run probably isnโ€™t going to be a good idea but Iโ€™d like to figure out the issue! :)

How large is your database file and each of the WAL files on S3? Can you unzip them and check the size too? Also, how much memory does your Cloud Run instance have? Does the issue happen if you use an instance with more memory?

magnuswahlstrand commented 3 years ago

FYI, Iโ€™m using google cloud storage for storage. Though I guess it isn't the problem here, since litestream is able to find, download and start the replication process, just not finish it.

File size is very small. Iโ€™m using your test application( + logging and some minor modifications for troubleshooting ๐Ÿ˜บ). It is just one table and < 100 rows.

DB file

> ls -lah pageviews.db
-rw-------  1 test  staff    16K May 10 08:34 pageviews.db

Litestream files (downloaded from gcs)

> du -h .
124K    ./generations/7787158ff4c84919/wal
4.0K    ./generations/7787158ff4c84919/snapshots
128K    ./generations/7787158ff4c84919
128K    ./generations
128K    .

I had 512 MB RAM, increased it to 1GB to test. Would be surprised if that is the problem here!

ngalaiko commented 3 years ago

Hi!

I've tried to use s6 setup to run litestream on DigitalOcean App Platform and ran into the same issue.

After digging in it for a while, I think the core reason is the sqlite3's WAL mode doesn't work with filesystems that services like CloudRun and App Platform use.

https://github.com/CGATOxford/CGATPipelines/issues/39 https://www.sqlite.org/faq.html#q5

SQLite uses reader/writer locks to control access to the database. (Under Win95/98/ME which lacks support for reader/writer locks, a probabilistic simulation is used instead.) But use caution: this locking mechanism might not work correctly if the database file is kept on an NFS filesystem. This is because fcntl() file locking is broken on many NFS implementations. You should avoid putting SQLite database files on NFS if multiple processes might try to access the file at the same time.

My guess is that embedding litestream into the go applications + using PRAGMA locking_mode=EXCLUSIVE is a way to make it run on such filesystems, but I am yet to try that.

jonfriesen commented 3 years ago

Adding a bit of context with regard to DigitalOcean App Platform support. App Platform runs apps on top of gVisor with a virtual filesystem (VFS). Version 1 of gVisor vfs does not support the f_getlk syscall which is used by sqlite. The next version of vfs (vfs2) adds support for this and App Platform hopes to upgrade to this soon after some additional functionality / bugs are resolved.

reference gVisor issue: fcntl errors when trying to use F_GETLK #5113

tmc commented 3 years ago

For what it's worth I ran into this problem while trying to set up litestream + GCS + Cloud Run here: https://github.com/tmc/moderncrud/tree/litestream

ngalaiko commented 3 years ago

@jonfriesen cool! any way I can know that update has happened?

jonfriesen commented 3 years ago

Hi @ngalaiko

I'm keeping an eye on this change, so once it's available I'll make a post in thi thread, but given I get hit by a bus, you can also get updates here:

I'm really excited to get litestream running, I'd love to get automatic cloud native buildpack support for it on App Platform.

matti commented 2 years ago

same, google cloud run errors

matti commented 2 years ago

update: using the second gen execution environment (in preview) helped and it works in google cloud run!

magnuswahlstrand commented 2 years ago

@matti does indeed work with the gen2 environment for Cloud Run! Thanks for the heads up.

Cold starts are 5-6s, which is a bit nasty, but Google has promised it will be better by the end of the pre-GA period :-) 50ms for warm starts is pretty awesome though. My little app seem to be chugging along just fine https://litestream-demo-quays3hgzq-ew.a.run.app !

I used the following command to deploy my Cloud Run service ( --execution-environment gen2 is the new addition).

PROJECT=$(gcloud config get-value project)
NAME=litestream-demo
TAG=gcr.io/$PROJECT/$NAME
gcloud builds submit --tag $TAG
gcloud beta run deploy $NAME --image $TAG \
            --platform=managed \
            --region=europe-west1 \
            --execution-environment gen2

@benbjohnson should I close this issue?

benbjohnson commented 2 years ago

Thanks to everyone for digging into this issue. Sounds like the second gen environment is working for folks so I'll close this out. ๐ŸŽ‰

leighmcculloch commented 1 year ago

@jonfriesen Do you know if this is fixed in DigitalOcean's Apps now, or where to track progress on that? I'm seeing the same disk I/O error: invalid argument when using the WAL as well.

jonfriesen commented 1 year ago

Hi @leighmcculloch, unfortunately this is still not supported. I'm not sure when it will be, though I am pushing for it internally. It could be a while :(

AvidDabbler commented 9 months ago

just wanted to push this again @jonfriesen. I ran into this with the DigitalOcean App platform

Edit: I'd recommend commenting on this issue if you still have problems https://www.digitalocean.com/community/questions/can-i-use-litestream-sqlite-replication-with-app-platform

jonfriesen commented 9 months ago

my sincerest apologies @AvidDabbler . I have since left DO and this is one of the pushes I couldn't get into production. Hopefully one day the team will be able to accomplish it. ๐Ÿ˜ž

jonfriesen commented 4 months ago

@AvidDabbler @leighmcculloch I have great news, App Platform introduced a new runtime and Litestream is now supported. I tested it earlier today with the litestream-docker-example repo and it worked wonderfully.