Open mukund-thakur opened 7 years ago
Hi and thank you for the feedback. Could you provide more information about your proposal? Is this to replace the backend relation database with a key-value store (zookeeper), or to use zookeeper as a distributed lock? If possible, please provide examples of where and how this has been implemented. Thanks.
+1 on the HA Web Server set up. Currently I have our Azkaban implementation set up on 2 separate servers, with the MySQL database replicating from one to the other. I have installed the web server on both, but can only have the jvm up on the primary, as if it is up on the secondary as well it will corrupt the database. It would be great if we could have multiple web server instances online with a load balancer in front.
Yes, we would like to scale out web servers too.
We are actively researching ways to do that. e.g. use a message queue such as Kafka to hold the runnable flow queue instead of keeping this state in the web server.
Ideas are welcome.
@HappyRay Is there any complete design document of azkaban. It would be really helpful for us to build a good design for making azkaban HA.
Hi Mukund - this is good idea to have some design doc. As part of next work we are going to work on two things that should directly answer this questions
I have two ideas to solve this.
IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node.
IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers.
Choice of Data Store(DS) We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay ) and decide which one to use.
Idea 2 is what we are implementing - with data store continues to be MySQL DB. We can replace this with something else - if it proves to be a performance bottleneck.
This should be ready and in open source by mid July.
Does that sound good?
On Tue, Mar 28, 2017 at 7:01 AM, mukund-thakur notifications@github.com wrote:
I have two ideas to solve this.
IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node.
IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers.
Choice of Data Store(DS) We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay https://github.com/HappyRay ) and decide which one to use.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/azkaban/azkaban/issues/952#issuecomment-289779324, or mute the thread https://github.com/notifications/unsubscribe-auth/AGN0ASRq1Hmvrn6I4rfloIWx9VvfHmOQks5rqRK2gaJpZM4MmqdT .
This sounds great.
When can release ?
Likely around August 2017. Roadmap here: https://github.com/azkaban/azkaban/wiki/Azkaban-4.x--Roadmap
Hi. I have a query on implementation of HA being done. Do you plan to restart jobs in 'Running' status after HA is done? Currently, the Running jobs move to Failed state when server is manually restarted after crash. Also, if I have a periodic scheduled job running say every 1 mins and current time is say 7:00. My Azkaban server crashes and it restarts at say at 7.10. Job instances between 7:00 and 7.10 will be missed. Do you plan to relaunch these instances as well after HA ?
Would stickyness be a factor that needs to be taken into account as well. I have a user here who is deployed behind an AWS ELB and as a result losses sessions (client IP changed in the headers). X-Forward-For may be a work-around?
Any news here ? Roadmap says, azkaban 4.0 should have HA webservers and should have been released in Q2 2017. but there are only 3.xx versions available yet.
I am looking forward to the HA to be released
We'll definitely prioritize the web server HA work. The first step of removing the cache on web server is already implemented but not enabled yet. Once web server becomes stateless, we can proceed with the next step of bringing up multiple web servers. This needs to be carefully designed and tested though. Thanks for your patience and please expect more time from our side.
Any update here? what is the expected time for Azkaban HA release.
@jamiesjc @hreview
@ameyamk Any update please let us know.
Any news guys ?
Azkaban web is now our single point of failure and HA is more than desirable, we tried to run 2 instances behind a VIP, and all jobs gets duplicated :/
Idea 2 is what we are implementing - with data store continues to be MySQL DB. We can replace this with something else - if it proves to be a performance bottleneck. This should be ready and in open source by mid July. Does that sound good? … On Tue, Mar 28, 2017 at 7:01 AM, mukund-thakur @.**> wrote: I have two ideas to solve this. IDEA 1 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) , when active goes down send an event so that passive one gets everything in memory from DS. We can use the leader election algorithm provided by zookeeper recipes for sending the event to the passive node. IDEA 2 We put all the in memory state of azkaban web server info( like runnableFlows etc ) in some data store(DS) ,there will be multiple web servers which will directly read/write from this DS rather than in memory. Clients will connect to a load balancer which will be on top of all web servers. Choice of Data Store(DS)* We can evaluate between mysql, couchbase, kafka (as suggested by @HappyRay https://github.com/HappyRay ) and decide which one to use. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#952 (comment)>, or mute the thread https://github.com/notifications/unsubscribe-auth/AGN0ASRq1Hmvrn6I4rfloIWx9VvfHmOQks5rqRK2gaJpZM4MmqdT .
Any news?
seems no news
Currently azkaban web server is SPOF which is very serious problem. We should invest in using zookeeper for making azkaban web server work in active/standby mode.