benbjohnson / litestream

Streaming replication for SQLite.
https://litestream.io
Apache License 2.0
11.06k stars 254 forks source link

generation creation failed due to S3 upload multipart failed #447

Open kakarukeys opened 1 year ago

kakarukeys commented 1 year ago

when starting litestream I saw this message in the log:

litestream v0.3.8
initialized db: /data/db.sqlite3
replicating to: name="s3" type="s3" bucket="xxx" path="lb-pipeline-prod/db.sqlite3" region="fra1" endpoint="https://fra1.digitaloceanspaces.com" sync-interval=1s
/data/db.sqlite3: init: cannot determine last wal position, clearing generation; primary wal header: EOF
/data/db.sqlite3: sync: new generation "c51b0ab65d5a9c1f", no generation exists

/data/db.sqlite3(s3): monitor error: MultipartUpload: upload multipart failed
        upload id: 2~AVDf9oLvoUjwWcYb5So7CZmoZnpUguF
caused by: TotalPartsExceeded: exceeded total allowed configured MaxUploadParts (10000). Adjust PartSize to fit in this limit

litestream snapshots / litestream generations does not reveal anything under new generation. Apparently the new generation creation has failed.

Is there any config I could set to tune the multipart upload?

my config is:

access-key-id: xxx
secret-access-key: xxx

dbs:
  - path: /data/db.sqlite3
    replicas:
      - url: s3://xxx.fra1.digitaloceanspaces.com/lb-pipeline-prod/db.sqlite3
        retention: 1h
        retention-check-interval: 20m
kakarukeys commented 1 year ago

maybe related to caused by: InvalidArgument: Part number must be an integer between 1 and 1000

benbjohnson commented 1 year ago

@kakarukeys There's not currently a config option for this. @anacrolix created a PR a while back but the change should be a configuration option. I'm open to a PR for it if you want to add the config fields.

anacrolix commented 1 year ago

@kakarukeys https://github.com/benbjohnson/litestream/pull/284

kakarukeys commented 1 year ago

i'd love to to. let me see if I can follow the code, and the previous PR. My golang skill got very rusty.

fyi another note, the above failure (OP) does not crash the container, does not raise any alarm. This together with the advice here to set pragma wal_autocheckpoint to 0 cause the WAL file to grow huge on my production server.

hifi commented 1 year ago

@kakarukeys We have a downstream patch that prevents the WAL growing in some cases: https://github.com/beeper/litestream/commit/cb44be6d5418a227e88b26501f3d1e485ed7b317

Does that work for you? I've only seen it in some rare error conditions and indeed got WALs that were gigabytes in size. We haven't upstreamed it yet as we're running on patched 0.3.9 which conflicts with the current git head.

kakarukeys commented 1 year ago

It might work, but I won't bet on that, because.... I am operating sqlite at crazy scale -> 350GB+ file, with several heavy writers and frequent readers. Even after` turning off litestream and re-enabling the default checkpointing, I see 200GB wal file sometimes.

I read somewhere, if there is not a single moment where the db is not locked for R/W, there is no chance for sqlite to do a checkpointing. I place my hope on the coming wal2 changes from sqlite (I think it might break litestream).