Open alihoseiny opened 1 year ago
For extra context, we also have following warning logs:
[2023-06-15 08:38:54,210] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.297 seconds
[2023-06-15 08:38:54,462] [1] [INFO] Timer Monitor.sampler woke up too late, with a drift of +0.336312010884285 runtime=4.301220178604126e-05 sleeptime=1.336312010884285
[2023-06-15 08:38:54,838] [1] [WARNING] Executing <Task pending name='Task-190' coro=<Agent._execute_actor() running at /usr/local/lib/python3.11/site-packages/faust/agents/agent.py:674> cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/site-packages/faust/agents/agent.py:664> took 0.102 seconds
[2023-06-15 08:38:55,500] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.281 seconds
[2023-06-15 08:38:55,768] [1] [INFO] Timer Monitor.sampler woke up too late, with a drift of +0.30657482892274857 runtime=4.564225673675537e-05 sleeptime=1.3065748289227486
[2023-06-15 08:38:56,789] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.280 seconds
[2023-06-15 08:38:58,079] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.283 seconds
[2023-06-15 08:38:59,384] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.296 seconds
[2023-06-15 08:38:59,629] [1] [INFO] Timer Monitor.sampler woke up too late, with a drift of +0.3187522292137146 runtime=4.420429468154907e-05 sleeptime=1.3187522292137146
[2023-06-15 08:39:00,682] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.290 seconds
[2023-06-15 08:39:00,933] [1] [INFO] Timer Monitor.sampler woke up too late, with a drift of +0.30384987592697144 runtime=5.805492401123047e-05 sleeptime=1.3038498759269714
[2023-06-15 08:39:01,969] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.277 seconds
[2023-06-15 08:39:03,264] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.289 seconds
[2023-06-15 08:39:04,560] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.287 seconds
[2023-06-15 08:39:05,851] [1] [WARNING] Executing <Task pending name='<coroutine object MethodQueueWorker._method_queue_do_work at 0x7f30346f8160>' coro=<Service._execute_task() running at /usr/local/lib/python3.11/site-packages/mode/services.py:843> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[Service._on_future_done()] created at /usr/local/lib/python3.11/asyncio/tasks.py:670> took 0.283 seconds
[2023-06-15 08:39:06,858] [1] [WARNING] [^--Consumer]: wait_empty: Waiting for tasks
[2023-06-15 08:55:59,973] [1] [INFO] Timer Monitor.sampler woke up too late, with a drift of +0.32391348481178284 runtime=4.1179358959198e-05 sleeptime=1.3239134848117828
[2023-06-15 08:56:01,013] [1] [WARNING] Executing <Task pending name='Task-763' coro=<AIOKafkaConnection._read() running at /usr/local/lib/python3.11/site-packages/aiokafka/conn.py:525> wait_for=<Future pending cb=[Task.task_wakeup()] created at /usr/local/lib/python3.11/asyncio/base_events.py:427> cb=[AIOKafkaConnection._on_read_task_error(<weakref at 0...x7f30230ac330>)()] created at /usr/local/lib/python3.11/site-packages/aiokafka/util.py:26> took 0.291 seconds
Likely a duplicate of https://github.com/faust-streaming/faust/issues/175 😦. If I figure out a solution, I'll immediately have a fix deployed.
@wbarnha I am not sure this is correlated but we had the issue that our broker was unavailable for some time. Then the heartbeat failed and the coordinator has been marked as dead. after some time when the broker comes back to life aiokafka starts a new coordinator, and shouldrejoin the group. Then all TopicPartitions are seeked to the last commited offset (although the faust leader has no last commited offset so it is seeked to 0). After that we see Adding Fetch request logs and no errors, but nothing is consumed, the offset never advances. Our consumer is behind 60msg or so though. Do you have any idea?
Is this an issue that our group member may be marked as dead, comes back to live and then not unmarked as dead, so kafka does not rebalance and assign stuff to the consumer? We only have one consumer and the group is marked as empty for us so a new join should do the trick I guess...
Edit: It looks like somewhen all topics except the
Edit2: maybe we pause a partition somewhere and never resume it. Just can't find why, where, when
Thank you for your thorough notes, I'll need to come back and review this. I wish there were more unittests to ensure all of Faust's coordinators behave properly, so I could understand which parts of them are shaky.
Yes sorry I couldn't yet wrap my head around this part so far. Thanks for your work ☺️
What's also strange is that I have a log in the following order.
Fetch request for topic with Highwater and last_stable_offset 694 After that I get a log records added for this topic with offset 693
Can this inconsistency stop the consumption as we are waiting for 694 and only geht 693? After those logs there are no more fetch requests for this specific topic... No other logs except metadata requests after the above log lines.
Found some discussions about this in the underlying aiokafka lib that might be related:
https://github.com/aio-libs/aiokafka/issues/625 https://github.com/aio-libs/aiokafka/issues/575 https://github.com/aio-libs/aiokafka/issues/727
Maybe we should check our on_partitions_revoked
callback.
Within the aiokafka documentation there is this part:
_It is quite critical to provide ConsumerRebalanceListener if you need to control rebalance start and end moments. In that case set the rebalance_timeout_ms to the maximum time your application can spend waiting in the callback. If your callback waits for the last getmany() result to be processed, it is safe to set this value to max_poll_intervalms, same as in Java client.
The default faust setting for this is broker_rebalance_timeout and broker_session_timeout, both set to 60 seconds by default.
on rebalance by default we do wait for stream to be empty. If this takes longer than the broker rebalance timeout or session timeout, our consumer is considered dead (by default the stream buffer max size is 4096 unprocessed items, make sure you can process those within 60 seconds.
So anyone having this issue: Could you check if this is the source of failure for you?
Actual behavior
After running the application for some days, agents die slowly and after a few days, no event consumes from topics. If I do not restart the docker container of the Faust app, all agents stop consumption after a few days.
Full traceback
Sample code
Sample code of one the consumers:
Versions