ChrisMcKenzie / styx

A Powerful Workflow Oriented CI/CD Platform.
3 stars 0 forks source link

Job Scheduler #1

Open ChrisMcKenzie opened 8 years ago

ChrisMcKenzie commented 8 years ago

In order to execute Workflows on a large scale we will need to design some sort of scheduler that will be responsible for finding the ideal "worker" node to run on. Off the top of my head it will need to take in to account the following metrics: CPU, Memory, and Disk.

I also think it will be important to have a an affinity towards node that already have required docker images this will allow quicker time-to-build

@IanSchweer I would love to have you help with this if you are interested.

Ch0ronomato commented 8 years ago

@ChrisMcKenzie

Yeah i'm willing to take a look at this. We should probably schedule a call or something so we can have an open discussion about this stuff.

ChrisMcKenzie commented 8 years ago

We can do that however I would like to keep any important information regarding the scheduler to stay here so we have an historical record.

With that being said lets schedule something on slack for discussing this.

Ch0ronomato commented 8 years ago

Updating this so it doesn't seem like a dead ticket.

I've been thinking about this, and I have a few questions and concerns about this. @ChrisMcKenzie these are probably only because I lack full specs.

  1. How is process creation managed?
  2. What type of architecture are we imposing? Centralized or decentralized?
  3. Are we sure a few metrics are enough?
  4. How are we recording these metrics, and how are we applying them?
  5. Where is our scheduler different than the node scheduler?
  6. Do we really need a sophisticated scheduler?

Explanation

  1. I think it's important to know this basic step. I'm assuming we're doing forking and execing the tasks. Its possible that one of these pipelines in a workflow is a resource hoag, and (presumably) that is why we need to assign the next task too another node. Is this assumption correct? Or are we assigning pipelines?
  2. Are we able to "ping" one node (a central node) to find the current laziest node? Or do we need to search through all the nodes to find the best one?
  3. There are a lot of cases I can think of where just CPU, Memory and Disk usage aren't enough (I'm grouping disk reads and writes as just one thing there). What about network traffic? What about GPU utilization? Or even more than just CPU utilization time like just CPU wait time? It's very possible that some task could require some network time (idk like a big curl request), or that some task is utilizing the GPU to do part of their build (a lot of people in the academic world do this, like CarlSIM at UCI. It's pretty neat what they do to build but enough pontificating).
  4. How are we recording these metrics? Do we just go and perform top (not actually top but you know what I mean) on each machine? After we retrieve these metrics are going to store them somewhere?
  5. Are the nodes going to be just basic *nix boxes with every other processes nice'd into the dirt? How can we guarantee that our submitted task is actually being ran from the OS, and that the metrics we record aren't reflective of some other task being ran on the machine?
  6. Do we really need something sophisticated? Even if we can assure that these metrics are indicative of the tasks we're running, are we really seeing a benefit? If we have one monster machine, and two little baby machines and we take these metrics and put them through some function such that 0 <= f(metrics) <= 1 where 0 means "I'm completely free" and 1 means "I'm completely busy", the big machine may take a long time to get enough work to where it is doing as much as the baby machines (say 100 tasks). Then we're up to the whim of the OS scheduler, which could take a while given the linux completely fair scheduler, meaning the baby machines could get more work done in that team, making them seem better than they are (so essentially we're making our metrics invalid if that makes sense). It might honestly be better to just assign the jobs in a queue like fashion, and have the nodes report when they're done to get a new task (or even pipeline).

We should put these answers in the wiki as well as answers to questions in #2

Ch0ronomato commented 8 years ago

p.s. Here is a paper I've been reading about the issue: https://cseweb.ucsd.edu/classes/sp99/cse221/projects/Scheduling.pdf

ChrisMcKenzie commented 8 years ago

My apologies for putting enough detail in to this issue. let me try to answer your questions in order.

How are processes managed?

I envision all execution taking place inside of a docker container this takes car of several concern such as isolation, I also hope to be able to allow users to use a Dockerfile as a build pipeline.

What type of Architecture?

I think the best choice for this type of system is a more centralized approach with 1 or more coordinating machines controlling many worker nodes.

Are we sure a few metrics are enough?

I believe so I detail my thoughts about this more in the sophisticated scheduler question.

How are we recording these metrics, and how are we applying them?

I have had reasonable success with using datastores like influxdb for this purpose. But I believe that will be something we need to agree on.

Where is our scheduler different than the node scheduler?

The scheduler will be outside of the worker nodes looking in and only taking a holistic view of the cluster not an in depth per process examination. also there will not be a node scheduler.

The flow I envision goes like this:

  1. user sends a job to our user facing api
  2. The api passes this to the scheduler
  3. The scheduler then finds the first available node

Essentially doing a bin-pack operation as we have no way of knowing how hungy a job will be. there are certainly some flaws to this but I feel that it will be a good start.

Do we really need a sophisticated scheduler?

I don't believe so, my only concerns are that we do are best to make sure jobs are left queued for longer than they need to be.