Netflix / genie

Distributed Big Data Orchestration Service
https://netflix.github.io/genie
Apache License 2.0
1.72k stars 369 forks source link

Cluster check task incorrectly marking running jobs failed in version 3.3.x #1065

Closed ramakrishnayes closed 3 years ago

ramakrishnayes commented 3 years ago

We recently upgraded our Genie cluster from 3.1.x to 3.3.20.

Issue: In our multi-node Genie cluster with zookeeper enabled, the cluster checker task that verifies the health of nodes marking all jobs FAILED too fast if a node's health endpoint is not available for a couple of seconds. This is bypassing the setting 'genie.tasks.clusterChecker.rate' which we set to 5 minutes. And the threshold 'genie.tasks.clusterChecker.lostThreshold' is set to 3.

Scenario: Let's say the leader node 'L' is verifying the health of all other nodes 'N1', 'N2', 'N3' in the cluster. If Genie app on node, say N2 is restarted about the same time when this check is happening and assume there are 4 jobs active on N2 at this time. The cluster checker task on node 'L' gets all hosts with active jobs as a list. In this case it is getting ['N2', 'N2', 'N2', N2',...] There are four 'N2' because there are four active jobs running on 'N2' The health check task then continuously tries to reach 'N2' (which is not started completely) using the health check endpoint and fails 4 times which exceeds the 'lostThreshold' which is set to 3.
It then updates all the jobs on 'N2' as failed.

Root cause: Seems the 'JpaJobSearchServiceImpl.getAllHostsWithActiveJobs()' is not returning unique hosts list in the latest versions 3.3.x (at least in 3.3.20 and 3.3.21).

Impact: Though it seems like a very rare case to happen, can be seen more frequently if restart of Genie service is more frequent either on individual node or on all the nodes. Any long running jobs are left orphaned until we kill explicitly.

tgianos commented 3 years ago

@ramakrishnayes Thanks for the detailed report and sorry for the delayed response.

We're not actively developing against 3.3.x as internally we've been running the 4.0.0 release candidates in production for a couple of years now. They're really only release candidates in OSS due to the need to polish up the documentation and finish a few more outstanding issues.

Is there a particular reason you're restarting the Genie server constantly on hosts? We've never had to really do that even when we were on 3.3.x but we would bring up entirely new clusters periodically so maybe we never had to restart in place.

ramakrishnayes commented 3 years ago

Thank you @tgianos I am not sure if we can upgrade to 4.x which is not officially released yet. This issue was raised to fix the API method (getAllHostsWithActiveJobs), to return results accordingly at least in future releases.

We are planning to stop restarting Genie service in near future with proper configuration in place and after proper testing.