Open ChrisMcKenzie opened 8 years ago
@ChrisMcKenzie
Yeah i'm willing to take a look at this. We should probably schedule a call or something so we can have an open discussion about this stuff.
We can do that however I would like to keep any important information regarding the scheduler to stay here so we have an historical record.
With that being said lets schedule something on slack for discussing this.
Updating this so it doesn't seem like a dead ticket.
I've been thinking about this, and I have a few questions and concerns about this. @ChrisMcKenzie these are probably only because I lack full specs.
tasks
. Its possible that one of these pipelines in a workflow is a resource hoag, and (presumably) that is why we need to assign the next task
too another node. Is this assumption correct? Or are we assigning pipelines
?We should put these answers in the wiki as well as answers to questions in #2
p.s. Here is a paper I've been reading about the issue: https://cseweb.ucsd.edu/classes/sp99/cse221/projects/Scheduling.pdf
My apologies for putting enough detail in to this issue. let me try to answer your questions in order.
How are processes managed?
I envision all execution taking place inside of a docker container this takes car of several concern such as isolation, I also hope to be able to allow users to use a Dockerfile as a build pipeline.
What type of Architecture?
I think the best choice for this type of system is a more centralized approach with 1 or more coordinating machines controlling many worker nodes.
Are we sure a few metrics are enough?
I believe so I detail my thoughts about this more in the sophisticated scheduler question.
How are we recording these metrics, and how are we applying them?
I have had reasonable success with using datastores like influxdb for this purpose. But I believe that will be something we need to agree on.
Where is our scheduler different than the node scheduler?
The scheduler will be outside of the worker nodes looking in and only taking a holistic view of the cluster not an in depth per process examination. also there will not be a node scheduler.
The flow I envision goes like this:
Essentially doing a bin-pack operation as we have no way of knowing how hungy a job will be. there are certainly some flaws to this but I feel that it will be a good start.
Do we really need a sophisticated scheduler?
I don't believe so, my only concerns are that we do are best to make sure jobs are left queued for longer than they need to be.
In order to execute
Workflows
on a large scale we will need to design some sort of scheduler that will be responsible for finding the ideal "worker" node to run on. Off the top of my head it will need to take in to account the following metrics:CPU, Memory, and Disk
.I also think it will be important to have a an affinity towards node that already have required docker images this will allow quicker time-to-build
@IanSchweer I would love to have you help with this if you are interested.