GoogleCloudPlatform / cloud-sql-proxy

A utility for connecting securely to your Cloud SQL instances
Apache License 2.0
1.26k stars 346 forks source link

Feature Request: Perform a graceful shutdown upon SIGTERM #128

Closed purplexa closed 5 years ago

purplexa commented 6 years ago

The method documented in Connecting from Google Container Engine for using Cloud SQL Proxy from GKE is to run the proxy as a sidecar container in your pod alongside your application. For normal use, this works well, but as this StackOverflow question describes, there are problems when trying to do the same in a Kubernetes Job.

The main issue is that for a Job to finish, all the containers in its Pod need to finish and return 0, so when using Cloud SQL Proxy as a sidecar container, even after the application container finishes, the proxy container stays around and Kubernetes believes the Job is not finished.

What would be helpful is for there to be some way to tell Cloud SQL Proxy to shut itself down cleanly after the application has finished. Perhaps something like a configurable timeout where the proxy shuts down after having no active connections for a period of time, or even some way for the application to explicitly tell the proxy it has finished before the application itself shuts down.

AthenaShi commented 6 years ago

This can be done by just SIGTERM the proxy on shutdown.

For more information, (well not directly related): https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/86

purplexa commented 6 years ago

Sending a SIGTERM won't work because it causes Cloud SQL Proxy to exit 2, not exit 0, which Kubernetes will treat as a Job failure, not a success.

EamonKeane commented 6 years ago

@thrnio having the same issue, did you find a workaround for this?

EamonKeane commented 6 years ago

possible solution here? https://stackoverflow.com/questions/41679364/kubernetes-stop-cloudsql-proxy-sidecar-container-in-multi-container-pod-job

One possible solution would be a separate cloudsql-proxy deployment with a matching service. You would then only need your migration container inside the job that connects to your proxy service.

This comes with some downsides:

higher network latency, no pod local mysql communication possible security issue if you provide the sql port to your whole kubernetes cluster If you want to open cloudsql-proxy to the whole cluster you have to replace tcp:3306 with tcp:0.0.0.0:3306 in the -instance parameter on the cloudsql-proxy.

purplexa commented 6 years ago

@EamonKeane not really, right now I'm using a script that that polls for the individual container statuses and checks for when the other container succeeds or fails then just deletes the Job, which sort of works well enough for most of my use cases, but not totally and it's far from ideal.

EamonKeane commented 6 years ago

yea not great. What is google up to here? Because egress IP is not controllable (I understand), sql-proxy seems to be the blessed way to connect, so might be expected to be robust to all GKE use cases. The other sigterm issue mentioned above (https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/86) has been open for months. Databases tend to be an important part of an application....

AthenaShi commented 6 years ago

Thanks for the explanations of the use case here, I've created an internal enhancement ticket for this and it will be prioritized.

The new feature would be:

  1. upon receiving SIGTERM (or some exit signal), Cloud SQL proxy would stop accepting new connections.
  2. Cloud SQL proxy would exit (return 0) if no active connections.
  3. After 30s (by default), Cloud SQL proxy would close all active connections and exits and returns 0.

============================================================================== Meanwhile, here's a workaround before we have this feature:

You can easily trap the sigterm signal the following way in your deployment:

command: ["/bin/bash", "-c", "trap 'sleep 30; exit 0' SIGTERM; /cloud_sql_proxy -dir=/cloudsql -instances=..."]

You can skip the "sleep 30" part if an immediate exit is needed. This is proposed from (#86) that's why I linked it here.

AthenaShi commented 6 years ago

And the reason why #86 is still open is b/c it doesn't want Cloud SQL proxy to continue accept new connection after SIGTERM.

lunemec commented 6 years ago

@AthenaShi the command you posted:

command: ["/bin/bash", "-c", "trap 'sleep 30; exit 0' SIGTERM; /cloud_sql_proxy -dir=/cloudsql -instances=..."]

is incorrect. There is no /bin/bash inside the container. You need to use /bin/sh:

command: ["/bin/sh", "-c", "/cloud_sql_proxy [options...]"]
eCeleritas commented 6 years ago

It is mentioned here that we can simply SIGTERM the proxy and then have the proxy handle the SIGTERM gracefully. How can we actually trigger the SIGTERM on the proxy from then container that is doing the actual work for the Job?

eCeleritas commented 6 years ago

I was able to handle this case by implementing the following:

  1. Create a new docker image based on gcr.io/cloudsql-docker/gce-proxy
  2. Add a script in the image, called watch_job.sh, which has the ability to interrogate the k8s api and determine if the pod in which the custom container is running has at least one completed container (in my scenario, I only have two containers - the proxy and the one doing the actual work). This script continues to run until the other container is complete.
  3. In the k8s job configuration, I pass the command:
    command: ["/bin/sh"]
            args: [ "-c",
                    "/cloud_sql_proxy -instances=$(SQL_INSTANCE) -credential_file=/secrets/cloudsql/credentials.json & /watch_job.sh"]

    So, this way, the k8s job is actually waiting on the watch_job.sh script to exit rather than waiting on cloud_sql_proxy. No need to trap the SIGTERM or anything else.

Notes:

bpiselli commented 6 years ago

@eCeleritas Thxs for the ideas : seems great : could you opensource your code ?

kurtisvg commented 6 years ago

@AthenaShi Can you update this issue with the current status of the work?

hfwang commented 6 years ago

178 adds a configurable wait after receiving SIGTERM, however it's not incredibly clever, it will wait the full configured wait since it doesn't know if there are any open connections, and it doesn't start rejecting new connections either.

kilianc commented 6 years ago

since it doesn't know if there are any open connections, and it doesn't start rejecting new connections either

@hfwang is anyone working on this?

kilianc commented 6 years ago

The fact that this is marked as feature request and not a bug is alarming. This is de-facto a show-stopper for production use. The documentation should actually mention that this is not production ready. I haven't seen any workaround that supports a clean shutdown that guarantees a consistent a reliable behavior.

Carrotman42 commented 6 years ago

In general I don't think this feature is necessary for generic use of the Cloud SQL Proxy - I believe this issue is mostly geared toward integration with Kubernetes.

The Proxy process is stateless; it can be killed (and restarted) if necessary and well-acting applications should handle it (e.g., retry when connections to the backend database break). It seems reasonable to me that the Proxy exits quickly with a non-zero status when it is sent a SIGTERM; any automation around the process (e.g. Kubernetes, systemd, upstart) could be programmed to handle this functionality; in this issue there are numerous workarounds for Kubernetes, each potentially tunable to every individual setup's requirements.

But maybe there should be some more common handling logic in the Proxy. Unfortunately, I can't tell from this thread that there is even a consensus about how this feature should work exactly, and how it is better than the current world in most cases.

@kilianc: you seem very interested in having this issue solved. Could you summarize exactly the interface and functionality you would like to see? Anyone else who has a concrete example and suggestion is welcome to chime in too. Once we reach a threshold of usefulness without over-fitting the known usecases we can move forward with a PR.

kilianc commented 6 years ago

@Carrotman42 I do understand your point of view, but a database connection layer that doesn't drain its connection pool it's unheard-of. I may be missing something; that's why I am so polarized on this.

Let me regroup my thoughts.

Current behavior

I guess I don't really see how this is the recommended way to deploy your app on k8s considering that some teams deploy multiple times a day and the goal is always 0 downtime.

I am open to any workaround that is SRE friendly and it's not impacting the end end user UX.

Expected behavior

Even always waiting for max grace time is bad because is going to make all pods in all my postgres app wait for max grace time for no reason, it's just not optimal.

The experience provided by the proxy should be comparable to what you would have connecting straight to Postgres.

kilianc commented 6 years ago

I found a solid workaround 🚀💯🔥🦄

App Container:

commands:
- /bin/sh
- -c
- node server.js & pid="$!"; trap "kill -s SIGTERM $pid; wait $pid; touch /opt/exit-signals/SIGTERM;" SIGTERM; wait $pid;
volumeMounts:
- mountPath: /opt/exit-signals
  name: exit-signals

Sidecar Container

lifecycle:
  preStop:
    exec:
      command:
      - /bin/sh
      - -c
      - until [ -f /opt/exit-signals/SIGTERM ]; do sleep 1; done;
volumeMounts:
- mountPath: /opt/exit-signals
  name: exit-signals

Deployment

volumes:
- name: exit-signals
  emptyDir: {}

Works perfectly.

Resources

N.B. This is not a final fix but just a workaround. If you expose the sidecar in a service and something else connects to it, those transactions will be dropped at every re-deployment. Classic use case is ETL of your db somewhere else.

N.B.B. This approach also works with CronJobs with a slight modification.

josephtyler commented 5 years ago

@kilianc Is this workaround still working well for you?

laszlocph commented 5 years ago

We implemented it a week ago. The basic tests showed that the containers shut down in the right order. Have it running in production on a single app. No issues so far, but of course it's early to tell for sure.

kilianc commented 5 years ago

@josephtyler running in prod since Aug 15th. Works for us. We also use it for cron jobs.

retpolanne commented 5 years ago

https://github.com/GoogleCloudPlatform/cloudsql-proxy/pull/206 worked right for me, but my only worry is to what happens if the cloudsql-proxy receives a connection when it's dying.

acasademont commented 5 years ago

I encourage all to try the new Private Ip (still in beta) connections and ditch the proxy altogether, works great!

On Sat, 1 Dec 2018 at 03:19, Vinicyus Macedo notifications@github.com wrote:

206 https://github.com/GoogleCloudPlatform/cloudsql-proxy/pull/206

worked right for me, but my only worry is to what happens if the cloudsql-proxy receives a connection when it's dying.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/128#issuecomment-443395214, or mute the thread https://github.com/notifications/unsubscribe-auth/AAyWvTpj20nGp8Dgh7vG9mJXoNIiPCEKks5u0fVXgaJpZM4P8nOh .

carlosfrancia commented 5 years ago

I tried to implement this https://github.com/GoogleCloudPlatform/cloudsql-proxy/issues/128#issuecomment-413444029 in a CronJob but it didn't work. It seems that "touch /opt/exit-signals/SIGTERM;" was never executed.

I used the following configuration:

App container

- name: app-container
  command:
   - /bin/sh
   - -c
   - node index.js & pid="$!"; trap "kill $pid; wait $pid; touch /opt/exit-signals/SIGTERM;echo CREATED" SIGTERM; wait $pid;
  volumeMounts:
     - mountPath: /opt/exit-signals
        name: exit-signals

Sidecar container

containers:
   - name: Sidecar-container
     lifecycle:
      preStop:
        exec:
           command:
            - /bin/sh
            - -c
            - until [ -f /opt/exit-signals/SIGTERM ]; do sleep 1; done;
     volumeMounts:
        - mountPath: /opt/exit-signals
           name: exit-signals

Cronjob volume configuration

jobTemplate:
  spec:
    template:
      spec:
        volumes:
        - name: exit-signals
           emptyDir: {}

Am I missing something? What slight modification do you mean?

hfwang commented 5 years ago

As an alternative: A flag was added in #206 so the proxy will wait up to some configurable period after receiving a sigint or sigterm for all connections to close before it terminates (and if so, can exit cleanly).

kilianc commented 5 years ago

@carlosfrancia it's probably a shell version mismatch. I am running on alpine-node

kilianc commented 5 years ago

@carlosfrancia make sure you're sending the correct signal kill -s SIGTERM $pid

giladsh1 commented 5 years ago

checkout this blog - How to terminate a side-car container in Kubernetes Job

taylorchu commented 1 year ago

We use in-pod supervisor (https://github.com/taylorchu/wait-for) to solve the sidecar container issue. It is simple and can be used even if you don't use k8s.

huwcbjones commented 1 year ago

FYI for anyone coming here from google, --term_timeout has been replaced with --max-sigterm-delay in v2 https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/1742#issuecomment-1568886109

enocom commented 1 year ago

For a full list of flag changes see the migration guide.

same-id commented 9 months ago

--max-sigterm-delay does not solve this with current behavior.

  1. Container exits when last connection is closed, however new connection might be needed for the pod while it shuts down - should just sleep
  2. quitquitquit could be used to signal the process to actually shut down while it sleeps (since if 1 is changed you need to know when to stop sleeping)

Instead, we will be adding a preStop command that sleeps and call the quitquitquit endpoint using httpGet postStop for the actual workload.

enocom commented 9 months ago

@same-id if there's a feature request in there, feel free to open a new issue. We're still working on smoothing out the k8s use cases.

same-id commented 9 months ago

@enocom, to be honest, I'm not sure - just saying that this feature does not 100% with k8s.

What we did, is simply switch to the alpine flavor of the image so we can have sleep binary in the image and sleep in the preStop hook while the actual service is gracefully shutting down.

Then we make our service call quitquitquit.

This also is not too great.

I believe that "smoothing out the k8s use cases" will soon be fixed in k8s instead (KEP-753):

https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/753-sidecar-containers/README.md

Which should already be available in most major cloud providers (Kubernetes 1.28) behind a feature flag.

Fun fact, also sleep hook will soon be introduced, so "goodbye" distro-full images. (https://github.com/kubernetes/enhancements/issues/3960)

enocom commented 9 months ago

Really hoping that proper sidecar container support in GKE makes all this go away.

This also is not too great.

Is it because it's more manual setup toil? What makes this not great? Asking to see if there's something we can do better here. Maybe a stop command to match the start command here https://github.com/GoogleCloudPlatform/cloud-sql-proxy/issues/1117?

same-id commented 9 months ago

Since "postStop" hooks do not exist in k8s:

  1. All our services need to be coded with quitquitquit on shutdown - if you forget, rollouts will be slower - they will wait for gracefulShutdown to expire fore cloud-sql-proxy.
  2. It does not make sense if the "main" service restarts for any reason, so cloud-sql-proxy will as well.
  3. Our code also waits for cloud-sql-proxy readiness before it starts, but that's another thing.

I just don't think there's any reason to pursue any hacky solution as long as KEP-753 will hit the road on December (even if it's feature-gated)

This was mainly me saying that --max-sigterm-delay is less useful than a plain-ol' sleep for all purposes and no need to add that to cloud-sql-proxy since it can just be used from the container image itself.

enocom commented 9 months ago

Thanks, @same-id. A lot of this stuff is hacking around the lack of KEP-753.

same-id commented 3 months ago

With kubernetes v1.29.4 sidecar containers are now supported (1.29.3 minor is required for Jobs bugfix)

All hail

devauxbr commented 1 month ago

Thanks @same-id ! I had the same problem and just discovered this github issue 👀 For those that will arrive here in the future : I can confirm that transitioning the cloudsql proxy to a native sidecar container (i.e. put it in the initContainer block and add a restartPolicy: Always) did the trick ! ✅

The documentation does state that kube will wait for the "main" container to stop before sending the SIGTERM to the sidecars container 🎉

devauxbr commented 1 month ago

I don't know if the GCP documentation is versioned in this repository, but I think this subtlety deserves to be documented there :)

enocom commented 1 month ago

Good point @devauxbr -- the Cloud SQL docs for Kubernetes could probably use a fresh revision.

cc @jackwotherspoon as FYI

enocom commented 1 month ago

Meanwhile, we could update our example here to make the new sidecar feature more obvious for folks. @devauxbr would you be interested in writing up an issue for what you ran into and how we might make life easier?

devauxbr commented 1 month ago

Thanks for your feedback @enocom ! About the problems I ran into, I am not sure if it deserves a new issue, as I had exactly the same use cases as both @purplexa and @kilianc :

The "fix" was to transition both my cronjobs and deployments from :

[...]
spec:
  containers:
    - name: main
      image: [my_app_image]
      [...]
    - name: cloud-sql-proxy
      image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.11.0
      args:
        - --private-ip
        - --auto-iam-authn
        - --exit-zero-on-sigterm
        - $(INSTANCE_NAME)
      env:
        - name: INSTANCE_NAME
          value: [my_cloudsql_instance_name]

To :

spec:
  containers:
    - name: main
      image: [my_app_image]
  initContainers:
    - name: cloud-sql-proxy
      restartPolicy: Always
      image: gcr.io/cloud-sql-connectors/cloud-sql-proxy:2.11.0
      args:
        - --private-ip
        - --auto-iam-authn
        - --exit-zero-on-sigterm
        - $(INSTANCE_NAME)
      env:
        - name: INSTANCE_NAME
          value: [my_cloudsql_instance_name]
        - name: CSQL_PROXY_HEALTH_CHECK
          value: "true"
        - name: CSQL_PROXY_HTTP_PORT
          value: "9801"
        - name: CSQL_PROXY_HTTP_ADDRESS
          value: 0.0.0.0
      startupProbe:
        failureThreshold: 60
        httpGet:
          path: /startup
          port: 9801
        periodSeconds: 1
        timeoutSeconds: 10

EDIT : the import change is moving the cloud-sql-proxy container in an initContainer block, and adding restartPolicy: Always to make it a "native K8S sidecar" (c.f. K8S announcement blogpost )

Note that I also needed to add the startupProbe config block (and their related env variables) to make sure K8S would wait for the cloudsql-proxy to be ready to receive connection, before starting my main app container

Do you want me to create an issue for that ? I think updating the docs with these pointers would be sufficient 🙏

enocom commented 4 weeks ago

Thanks @devauxbr for the super clear write-up. We'll get this added to the docs and I'll just open an issue with these details to track it (and make it easy for other folks to find).