azkaban / azkaban

Azkaban workflow manager.
https://azkaban.github.io
Apache License 2.0
4.45k stars 1.59k forks source link

Implement HA solution for Azkaban Web Server #952

Open mukund-thakur opened 7 years ago

mukund-thakur commented 7 years ago

Currently azkaban web server is SPOF which is very serious problem. We should invest in using zookeeper for making azkaban web server work in active/standby mode.

li-afaris commented 7 years ago

Hi and thank you for the feedback. Could you provide more information about your proposal? Is this to replace the backend relation database with a key-value store (zookeeper), or to use zookeeper as a distributed lock? If possible, please provide examples of where and how this has been implemented. Thanks.

devangorder commented 7 years ago

+1 on the HA Web Server set up. Currently I have our Azkaban implementation set up on 2 separate servers, with the MySQL database replicating from one to the other. I have installed the web server on both, but can only have the jvm up on the primary, as if it is up on the secondary as well it will corrupt the database. It would be great if we could have multiple web server instances online with a load balancer in front.

HappyRay commented 7 years ago

Yes, we would like to scale out web servers too.

We are actively researching ways to do that. e.g. use a message queue such as Kafka to hold the runnable flow queue instead of keeping this state in the web server.

Ideas are welcome.

mukund-thakur commented 7 years ago

@HappyRay Is there any complete design document of azkaban. It would be really helpful for us to build a good design for making azkaban HA.

ameyamk commented 7 years ago

Hi Mukund - this is good idea to have some design doc. As part of next work we are going to work on two things that should directly answer this questions

  1. As Ray mentioned we are working on scaling out web server. As initial step we will just move state out of web server, and provide distributed scheduler - this way you can run multiple web servers - effectively removing SPOF for web server.
  2. We will also be working on improved documentation - both should be ready and available in open source by mid July or so
mukund-thakur commented 7 years ago

I have two ideas to solve this.

IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node.

IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers.

Choice of Data Store(DS) We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay ) and decide which one to use.

hreview commented 7 years ago

Idea 2 is what we are implementing - with data store continues to be MySQL DB. We can replace this with something else - if it proves to be a performance bottleneck.

This should be ready and in open source by mid July.

Does that sound good?

On Tue, Mar 28, 2017 at 7:01 AM, mukund-thakur notifications@github.com wrote:

I have two ideas to solve this.

IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node.

IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers.

Choice of Data Store(DS) We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay https://github.com/HappyRay ) and decide which one to use.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/azkaban/azkaban/issues/952#issuecomment-289779324, or mute the thread https://github.com/notifications/unsubscribe-auth/AGN0ASRq1Hmvrn6I4rfloIWx9VvfHmOQks5rqRK2gaJpZM4MmqdT .

mukund-thakur commented 7 years ago

This sounds great.

yxydde commented 7 years ago

When can release ?

ameyamk commented 7 years ago

Likely around August 2017. Roadmap here: https://github.com/azkaban/azkaban/wiki/Azkaban-4.x--Roadmap

goelrajat commented 7 years ago

Hi. I have a query on implementation of HA being done. Do you plan to restart jobs in 'Running' status after HA is done? Currently, the Running jobs move to Failed state when server is manually restarted after crash. Also, if I have a periodic scheduled job running say every 1 mins and current time is say 7:00. My Azkaban server crashes and it restarts at say at 7.10. Job instances between 7:00 and 7.10 will be missed. Do you plan to relaunch these instances as well after HA ?

sellers commented 6 years ago

Would stickyness be a factor that needs to be taken into account as well. I have a user here who is deployed behind an AWS ELB and as a result losses sessions (client IP changed in the headers). X-Forward-For may be a work-around?

steverding commented 6 years ago

Any news here ? Roadmap says, azkaban 4.0 should have HA webservers and should have been released in Q2 2017. but there are only 3.xx versions available yet.

gao634209276 commented 5 years ago

I am looking forward to the HA to be released

jamiesjc commented 5 years ago

We'll definitely prioritize the web server HA work. The first step of removing the cache on web server is already implemented but not enabled yet. Once web server becomes stateless, we can proceed with the next step of bringing up multiple web servers. This needs to be carefully designed and tested though. Thanks for your patience and please expect more time from our side.

avi-0107 commented 5 years ago

Any update here? what is the expected time for Azkaban HA release.

@jamiesjc @hreview

praxnet commented 4 years ago

@ameyamk Any update please let us know.

oonashvili commented 4 years ago

Any news guys ?

rafilkmp3 commented 4 years ago

Azkaban web is now our single point of failure and HA is more than desirable, we tried to run 2 instances behind a VIP, and all jobs gets duplicated :/

lchqlchq commented 3 years ago

Idea 2 is what we are implementing - with data store continues to be MySQL DB. We can replace this with something else - if it proves to be a performance bottleneck. This should be ready and in open source by mid July. Does that sound good? On Tue, Mar 28, 2017 at 7:01 AM, mukund-thakur @.**> wrote: I have two ideas to solve this. IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node. IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers. Choice of Data Store(DS)* We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay https://github.com/HappyRay ) and decide which one to use. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#952 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AGN0ASRq1Hmvrn6I4rfloIWx9VvfHmOQks5rqRK2gaJpZM4MmqdT .

Any news?

sansna commented 3 years ago

seems no news