azkaban / azkaban

Azkaban workflow manager.
https://azkaban.github.io
Apache License 2.0
4.45k stars 1.59k forks source link

Scheduled jobs stuck in preparing state #2124

Open trentgerman opened 5 years ago

trentgerman commented 5 years ago

Hi All -

I'm seeing an issue with our production azkaban instance where all scheduled jobs get stuck in the preparing state:

image

I can't seem to find any reproducible pattern, it happens about once a week.

Bouncing the webserver solves the problem. This is what I can see from the logs. The first flow that was stuck had execution id 7091. I've included the entire log for context, its around 1672 lines. What I see again and again is this ->

2019-02-12 18:45:15 INFO  ExecutorManager:256 - Successfully refreshed executor: ip-10-99-47-200:12321 (id: 65) with executor info : ExecutorInfo{remainingMemoryPercent=76.61209040896966, remainingMemoryInMB=6116, remainingFlowCapacity=30, numberOfAssignedFlows=0, lastDispatchedTime=1549995435117, cpuUsage=0.0}
2019-02-12 18:45:15 INFO  ExecutorManager:1799 - Using dispatcher for execution id :7091
2019-02-12 18:45:15 INFO  ExecutorManager:1831 - Reached handleNoExecutorSelectedCase stage for exec 7091 with error count 0
2019-02-12 18:45:15 INFO  ExecutorManager:1799 - Using dispatcher for execution id :7091
2019-02-12 18:45:15 INFO  ExecutorManager:1831 - Reached handleNoExecutorSelectedCase stage for exec 7091 with error count 0

As the number of backed up jobs grow, it reaches handleNoExecutorSelectedCase more and more, once per executor.

We're using Azkaban version 3.39.0-12-gb44c688

azklogs-broken.txt

trentgerman commented 5 years ago

Oh! The error just happened again, starting with execution 7102. I see this in the logs -

331 2019-02-12 20:15:39 INFO  ExecuteFlowAction:230 - Invoked flow sqlserver.product_dataset_int
332 2019-02-12 20:15:39 INFO  Condition:122 - Done resetting checkers. The next check time will be 2019-02-13T14:15:00.000-06:00
333 2019-02-12 20:15:39 INFO  JdbcTriggerImpl:138 - Updating trigger 47 into db.
334 2019-02-12 20:15:39 INFO  ExecutorManager:256 - Successfully refreshed executor: ip-10-99-47-200:12321 (id: 65) with executor info : ExecutorInfo{remainingMemoryPercent=75.98592472703717, remainingMemoryInMB=6066, remainingFlowCapacity=28, numberOfAss    ignedFlows=2, lastDispatchedTime=1550001639500, cpuUsage=0.0}
335 2019-02-12 20:15:39 INFO  ExecutorManager:1799 - Using dispatcher for execution id :7102
336 2019-02-12 20:15:39 INFO  ExecutorManager:1831 - Reached handleNoExecutorSelectedCase stage for exec 7102 with error count 0
337 2019-02-12 20:15:39 INFO  ExecutorManager:1799 - Using dispatcher for execution id :7102
338 2019-02-12 20:15:39 INFO  ExecutorManager:1831 - Reached handleNoExecutorSelectedCase stage for exec 7102 with error count 0

Same error. Any ideas?

lcaaaat commented 5 years ago

It seems there is not enough resource(e.g. memory) to dispatch flow. Could show me the value of "azkaban.executorselector.filters" and "azkaban.executorselector.comparator.xxx" in conf/azkaban.properties?

trentgerman commented 5 years ago

Yes! They are here ->


$ find . -wholename '*conf/azkaban.properties' -exec grep -EHn "azkaban\.executorselector\.filters|azkaban\.executorselector\.comparator" {} \;
./azkaban-exec-server/build/install/azkaban-exec-server/conf/azkaban.properties:59:azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
./azkaban-exec-server/build/install/azkaban-exec-server/conf/azkaban.properties:60:azkaban.executorselector.comparator.NumberOfAssignedFlowComparator=1
./azkaban-exec-server/build/install/azkaban-exec-server/conf/azkaban.properties:61:azkaban.executorselector.comparator.Memory=1
./azkaban-exec-server/build/install/azkaban-exec-server/conf/azkaban.properties:62:azkaban.executorselector.comparator.LastDispatched=1
./azkaban-exec-server/build/install/azkaban-exec-server/conf/azkaban.properties:63:azkaban.executorselector.comparator.CpuUsage=1
./azkaban-web-server/build/install/azkaban-web-server/conf/azkaban.properties:59:azkaban.executorselector.filters=StaticRemainingFlowSize,MinimumFreeMemory,CpuStatus
./azkaban-web-server/build/install/azkaban-web-server/conf/azkaban.properties:60:azkaban.executorselector.comparator.NumberOfAssignedFlowComparator=1
./azkaban-web-server/build/install/azkaban-web-server/conf/azkaban.properties:61:azkaban.executorselector.comparator.Memory=1
./azkaban-web-server/build/install/azkaban-web-server/conf/azkaban.properties:62:azkaban.executorselector.comparator.LastDispatched=1
./azkaban-web-server/build/install/azkaban-web-server/conf/azkaban.properties:63:azkaban.executorselector.comparator.CpuUsage=1```
lcaaaat commented 5 years ago

The free memory of azkaban executor server must be more than 6G if you enable "MinimumFreeMemory". If not, web server will not dispatch flow. So, you can solve it by removing "MinimumFreeMemory" in "azkaban.executorselector.filters" or using another server which has more free memory to deploy azkaban executor server.

trentgerman commented 5 years ago

excellent! I will try that.

Any idea why the webserver will dispatch the flow after a bounce?