deis / postgres

A PostgreSQL database used by Deis Workflow.
https://deis.com
MIT License
36 stars 22 forks source link

crash looping on GKE: wal-e pipe capacity issue with newer kernels #154

Open tmc opened 7 years ago

tmc commented 7 years ago

This is apparently due to https://github.com/wal-e/wal-e/issues/270

example spew:

wal_e.retries WARNING  MSG: retrying after encountering exception
        DETAIL: Exception information dump:
        Traceback (most recent call last):
          File "/usr/local/lib/python2.7/dist-packages/wal_e/retries.py", line 62, in shim
            return f(*args, **kwargs)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/worker/gs/gs_worker.py", line 76, in fetch_partition
            with get_download_pipeline(PIPE, PIPE, self.decrypt) as pl:
          File "/usr/local/lib/python2.7/dist-packages/wal_e/pipeline.py", line 92, in __enter__
            self.stdin = pipebuf.NonBlockBufferedWriter(stdin)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 225, in __init__
            _setup_fd(self._fd)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 62, in _setup_fd
            set_buf_size(fd)
          File "/usr/local/lib/python2.7/dist-packages/wal_e/pipebuf.py", line 53, in set_buf_size
            fcntl.fcntl(fd, fcntl.F_SETPIPE_SZ, OS_PIPE_SZ)
        IOError: [Errno 1] Operation not permitted

        HINT: A better error message should be written to handle this exception.  Please report this output and, if possible, the situation under which it arises.
        STRUCTURED: time=2016-10-22T20:14:41.882253-00 pid=221
bacongobbler commented 7 years ago

Thanks for the report @tmc! What provider and k8s version did you use to deploy Workflow? That can help us nail down the issue so we can figure out a fix we could propose upstream and then bump wal-e to a release with that fix.

Did the fix in https://github.com/wal-e/wal-e/issues/270#issuecomment-240811659 work for you?

bacongobbler commented 7 years ago

It might be a neat experiment to try a slightly older kernel version with kubernetes and see if this issue still persists.

tmc commented 7 years ago

Yes that fix worked, GKE, k8s 1.4

tmc commented 7 years ago

this is still present on k8s 1.4.6 on GKE

bacongobbler commented 7 years ago

Unfortunately there is nothing we can do on our end to fix this other than to use the provided workaround or to fix it in wal-e and bump the installed version. If you can provide a patch that fixes this issue for you, please make a PR upstream and we can bump wal-e forwards to the fix once it's merged.

I haven't seen this issue in the wild on GKE with k8s 1.4+ so I don't have a reliable test case (or even the slightest idea how this issue crops up) to test a fix against. Until then I cannot help you.

fdr commented 7 years ago

Hey all, WAL-E maintainer here.

I will accept a patch with a lower pipe size that doesn't tank performance that works with defaults or some adaptive code to deal with this new limit. I suspect the adaptive approaches may be more trouble than its worth, but if someone can surprise me, that'd be great.

lgastako commented 7 years ago

I'm on k8s 1.5.2 on GKE. When I try the workaround from https://github.com/wal-e/wal-e/issues/270#issuecomment-240811659 I get:

root@deis-database-540367895-x6r6d:/# echo 0 > /proc/sys/fs/pipe-user-pages-soft
bash: /proc/sys/fs/pipe-user-pages-soft: Read-only file system

Am I doing something wrong?

lgastako commented 7 years ago

For anyone else that runs into the same problem I had, I was able to solve it by downloading the workflow chart, unpacking it and editing database-deployment.yml to add the annotation security.alpha.kubernetes.io/sysctls: fs.pipe-user-pages-soft=0 to the Deployment.

tmc commented 7 years ago

I ran into this again on an upgrade.

Bregor commented 7 years ago

security.alpha.kubernetes.io will not work in GKE, because all alpha features disabled there.

fdr commented 7 years ago

I think one could write a patch with some arithmetic to solve this (by reading the value, generously estimating how many pipes will be created, and dividing).

On Wed, May 10, 2017 at 12:29 PM Travis Cline notifications@github.com wrote:

I ran into this again on an upgrade.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/deis/postgres/issues/154#issuecomment-300588254, or mute the thread https://github.com/notifications/unsubscribe-auth/AAAcF-tkSOpzFHFhZVfGPHnO8RPJaDDFks5r4hAWgaJpZM4Kd9Z8 .