azkaban / azkaban

Azkaban workflow manager.
https://azkaban.github.io
Apache License 2.0
4.42k stars 1.58k forks source link

Grouping of executors in azkaban. #670

Open mukund-thakur opened 8 years ago

mukund-thakur commented 8 years ago

Hi All, We at https://www.flipkart.com/ use azkaban as our workflow manager for submitting hive jobs to hadoop cluster. We have been planning to implement grouping of executors such that user can submit flows to specific executors depending on the group name. This feature will be backward compatible.Here is our thought :

Executors table will contain an extra column called (group) .Users can specify different groups for all executor hosts. User has to specify the groupName as parameter while submitting a flow. This parameter will be optional. Based on the group name executors will be filtered and flow will be submitted by performing load balancing among the filtered executors. If user doesn't specify the group name , whole flow will fall back to the current flow. If user specifies the wrong group name which is not present in executor table, the API will throw an exception with proper message.

Vimos commented 8 years ago

Up vote for the idea, I am thinking that jobs of the same flow should be able to be assigned to different executor groups. Grouping becomes job properties, and executor servers are of different group, so ExecutorFilter can find the right group for each job of the flow.

xfaris commented 7 years ago

Up vote

HappyRay commented 7 years ago

Thanks for the suggestion.

I am interested hearing more specific use cases. e.g. why is this feature useful for you.

xfaris commented 7 years ago

The whole system consists of many services, with different environment configuration, applications, and infrastructure. We must build several azkaban system to support this.

mukund-thakur commented 7 years ago

@HappyRay The real motivation behind this idea was: We are having different class of linux machines having different configurations of memory, disk etc. Also at the same time we have different class of jobs like low/high priority jobs, test/production jobs. And now since we have multi executor support in Azkaban 3.0, we can leverage this executor grouping feature to submit different class of jobs to different executors thus isolating jobs execution based on business use cases. Even we can control the number of executors assigned to different class of jobs.

We have already implemented this by forking the latest azkaban 3.0 codebase and using it successfully in production in last 6 months.

HappyRay commented 7 years ago

@mukund-thakur Thanks. It makes sense. May I ask what your design is? We are thinking about implementing this feature in the future.

mukund-thakur commented 7 years ago

@HappyRay Here is the design which i have implemented in our version:

Executors table will contain an extra column called (group) .Users can specify different group name to all executor hosts. User has to specify the groupName as parameter while submitting a flow. This parameter will be optional. Based on the group name executors will be filtered and flow will be submitted by performing load balancing among the filtered executors. If user doesn't specify the group name , whole flow will fall back to the current flow. If user specifies the wrong group name which is not present in executor table, the API will throw an exception with message saying group name is not valid.

Any suggestions would be appreciated.

HappyRay commented 7 years ago

Thanks.

mukund-thakur commented 7 years ago

Hi @HappyRay , How is my design approach. Would you like me to start working on this and create a pull request on latest azkaban codebase.

ameyamk commented 7 years ago

One thing I'd add to this is - we should make sure that single executor JVM can support multiple groups/pools. So we have many to many relationship here.

Azkaban can have multiple executors, and all of them share "executor groups/ pools" Also - we should make sure that we have default "executor group/ pool" - This way any flows submitted without group/ pool will land in this default pool.

HappyRay commented 7 years ago

@mukund-thakur Your design should work. However we are thinking about scaling out the web server. One option we are considering is to leverage a distributed queue to distribute flows/jobs. This may affect the design of the executor pool and flow priority support.

mukund-thakur commented 7 years ago

@HappyRay Thanks . Do let me know if i can contribute in any other enhancements.

burgerkingeater commented 7 years ago

@mukund-thakur thanks for your suggestion. Can you share more details of your user case or production experience of this change? For example, executor host/group setup, number of jobs running on them daily, etc. Though our flow/job configuration might change in the future which might affect implementation details of the change, but we are very interested in learning the user experience.

mukund-thakur commented 7 years ago

@chengren311
We have an inhouse framework which creates a DAG of hadoop, hive jobs with dependencies and submits them to azkaban. We have azkaban 3.0 multi executor setup having one web server and 16 executors of four different pool( scheduled, adhoc, test, tagged) . Executors pools have different configurations. We almost run total 6000 jobs daily. This change was done around july 2016 and running perfectly fine from then till now.

burgerkingeater commented 7 years ago

@mukund-thakur Thanks for sharing those details. We are interested in taking this feature into the main branch. My plan is we set up a development branch which you and I can both contribute code to and review each other's code so that we can iterate faster. We will merge it into master once the PR is approved by other azkaban team members. What do you think?

ameyamk commented 7 years ago

We are super interested in this - @mukund-thakur - can we move forward on this one?

mukund-thakur commented 7 years ago

Yeah, lets do this. I have already added this feature in our production azkaban code. Lets create new branch from master as base brach. I will cherry-pick my commits and then we can proceed further. @chengren311 can review my initial changes.

burgerkingeater commented 7 years ago

@mukund-thakur thanks. I created a branch in my azkaban fork repo called executorpool_dev, and just sent you an invitation as collaborator.

burgerkingeater commented 7 years ago

We had a discussion in the team and here is the outcome of it:

  1. We should have a consistent name for the feature, initially we use executor pool internally and @mukund-thakur 's pull request uses executor group. I don't have strong preference between these two, but the minor issue of using "group" is it's a mysql keyword adding a bit complexity when we write SQL query. So we think it's better to call it "pool".
  2. Executors' pool name should be configurable in executor's property file.
  3. We should have a default pool which execution goes to if its pool name is not specified. The default pool's name should also be configurable in web server's property file. If default pool name is not specified in config file, we will use a hardcoded name. If an executor doesn't have a pool name, then its name is default pool's name.
  4. In execution history page of azkaban UI(WEB_SERVER_URL:WEB_SERVER_PORT/executor), we should have one additional column tracking pool name of that execution.
mukund-thakur commented 7 years ago

For Point 1: That makes sense. Will change the name to pool.

For Point 2: What benefit we are getting by putting the the pool name in executor property fine. It will just be a redundant value. The reason i am saying this is , its the web server which initialises the ExecutorManager which contains all the information about executors. The main point i am trying to convey here is how you guys are planning to use the value of pool from properties file.

For Point 3: We can keep the default pool in the web server's property file. But if there is no executor present in the database with the default pool name , the flow having no pool specified will starve during submission. Also by doing this , we are forcing other azkaban user to use this executor pooling feature even if they don't want to.

For Point 4: Need to check the flow of history page. I will make any ajax api changes if required. I have no experience in UI , so need some help with that.

mukund-thakur commented 7 years ago

Edit For Point 2: Actually i was not aware of this insertExecutorEntryIntoDB(). Now i understand how we can use the value present in the properties file. But i will point out a very serious problem in this one. Suppose one day some executors of an important pool went down because of some issue. And now since this is an important pool(say prod_pool) , we would like to add executors of different pool(say stage_pool) to this prod_pool. For doing that through config file, we would need to restart the the executors of stage_pool which will lead to killing of all flows in stage_pool. But if we directly change just the pool name in the executor db and reload the executors, all the new prod_flows will go to executors of stage_pool. This will be like graceful degradation.

I agree we shouldn't be directly accessing db in prod environments. But to solve that problem we can build api's and UI for executor related edits which admins can use.

I would also suggest removing the dependency of properties files for executor server in future. This is in line with the same issue of all running flows getting killed if we have to change some properties like max concurrent jobs and restart the executor server. In my opinion we should keep all the executor properties in executor table only ( may be in json format) and provide an api to reload the properties just like reload executors.

burgerkingeater commented 7 years ago

@mukund-thakur Thanks for your input and I think your thoughts on point2 is valid concern. But to change pool name for executor doesn't necessarily require restarting executor. We can:

  1. change the pool name in executor db
  2. reload executors
  3. change the pool name in config file to make it consistent with what's in db.
mukund-thakur commented 7 years ago

Yes, that is what we are doing currently in our prod cluster. But here also we have to access db directly. Anyways putting pool in executor server properties makes sense to make this call insertExecutorEntryIntoDB() compatible.

burgerkingeater commented 7 years ago

@mukund-thakur for your thoughts on point 3, as we discussed in google hangout, if an executor is not specified a pool name, then its pool name should be default pool name.

luoguohao commented 6 years ago

how things progress? we also have some cases.

ameyamk commented 6 years ago

The current line of thinking is this will be done via decoupling Azkaban and Hadoop. This way we can launch jobs into multiple Hadoop clusters from the same executor.

This is going to take some time though. Above patch is (was) fully working - so you can pull that into your own build if this is urgent for you

luoguohao commented 6 years ago

thanks a lot for your proposal,I will try if I need.

mudit-97 commented 3 years ago

Hi all, what is the current progress on this? Is this merged? If not, would like to be part of it to contribute