GoogleCloudPlatform / reliable-task-scheduling-compute-engine-sample

Apache License 2.0
146 stars 137 forks source link

Listeners keep crashing, creating massive numbers of subscriptions on restarts #10

Open kir-titievsky opened 7 years ago

kir-titievsky commented 7 years ago

A GCP customer using this code in production has reported continuous crashes, which lead the code -- specifically, Executor.get_subscription() -- to delete and create subscriptions at a great rate. Could the authors of this code please offer any advice on determining the root cause of listener crashes? Is there a way to handle discarding old tasks without re-creating subscriptions (e.g. keep a T0 timestamp and discard any messages older than that)?

Here's what I know of the environment: Single GCE instance, the code being run with supervisor.

More detail from the customer:

Actually, we wanted to create one listener which could listen to all the topics and execute the corresponding python executables. However, we were not able to do that because we noticed that if we create multiple executors in this listener class(a python file), only the first Executor instance watches the topic.

For this reason, we ended up creating separate listener (python file) for listening to each topic. Each listener is run via gunicorn at some specific port using supervisor. Essentially, there are multiple process running at different ports to listen to each topic.

We have noticed that these processes get killed multiple times in a day and get restarted multiple times. Because of this our background jobs are not run at the scheduled time many times. Following is the log snippet from supervisord.log [subscription names redacted]

2016-09-30 18:08:50,898 INFO stopped: xx-yy (terminated by SIGKILL)2016-09-30 18:08:51,900 INFO waiting for xx-xx, xx-xx to die2016-09-30 18:08:54,905 INFO waiting for xx-xx, xx-xx to die2016-09-30 18:08:57,909 INFO waiting for xx-xx, yy-xx to die2016-09-30 18:09:00,922 WARN killing 'yy-xx' (22136) with SIGKILL2016-09-30 18:09:00,922 INFO waiting for xx-xx, yy-xx to die2016-09-30 18:09:00,924 INFO stopped:xx (terminated by SIGKILL)2016-09-30 18:09:03,928 INFO waiting for xxx to die2016-09-30 18:09:06,930 INFO waiting for ck-xxx to die2016-09-30 18:09:09,934 INFO waiting for ck-scheduleconnectiondeltaimport to die2016-09-30 18:09:10,935 WARN killing 'xxx' (831) with SIGKILL2016-09-30 18:09:10,937 INFO stopped: xxx (terminated by SIGKILL)

lucidsushi commented 5 years ago

@kir-titievsky Just curious, was this resolved for your client?