Load Balanced Slave Cluster

goffinf commented 7 years ago

Before I submit the full details, I want to ask about using YADP where the Docker URL property in the Cloud specification points to a load balancer (in my case and AWS ELB) under which there are multiple docker host instances.

My observed behaviour:

If I have a SINGLE instance under the ELB, slave containers launch successfully and run whatever job you ask them to.
If I attach a second host to the ELB and execute a Jenkins job, multiple containers launch on BOTH hosts (and continue to do so until you abort the job).

In the Jenkins log you see lots of exception messages like this (with differing container ids) ...

Error during callback
com.github.kostyasha.yad_docker_java.com.github.dockerjava.api.exception.NotFoundException {"message":"No such container: f859ea6e2707d21bb9a2d585713fbc262629b2926f0d86453d28d76bf48fd811"}

When you abort the job you have a bunch of containers to clean up on both hosts.

So, I'm not sure if this plugin only works with single docker hosts or it can be configured to a load balanced cluster.

Obviously we want the latter so that we have a scalable and resilient Jenkins capability.

Can anyone suggest how we could configure for multiple slaves ?

Regards

Fraser.

cpoole commented 7 years ago

what type of slave launching are you using? ssh or jnlp?

regardless what you really want is launching slaves on docker swarm... which is open here https://github.com/KostyaSha/yet-another-docker-plugin/issues/54

goffinf commented 7 years ago

Hey Connor,

I'm using JNLP.

Re: swarm mode, I agree, and I have commented on that issue with this particular use case. But as I and others before me have pointed out, the unit of deployment in a swarm is 'service' and whilst the API calls that relate specifically to individual container should still work, they won't be part of the swarm so would suffer the same problem as I am seeing her (they would work if you had a single node but not for more than that). So YADP really needs to support the service based API as well.

I have to say the mood of that issue appears to have swayed away from resolution, but I am hoping if more people +1 or add other use cases, the maintainers might be prepared to take another look. Certainly everyone I come across in the corporate space is using one scheduler or another so tooling should probably play nice in that space.

It's possible that the older stand-alone swarm would work, but I'm reluctant to implement something that is essentially deprecated along with all of the satellite capabilities that it doesn't include and are already embedded in the docker engine swarm mode.

cpoole commented 7 years ago

@goffinf I'm in the same boat as you. There is a PR we've tested that adds dumb "least used" provisioning so that you can have redundant docker hosts.

https://github.com/KostyaSha/yet-another-docker-plugin/pull/167 It works, but will only work if every one of your machines is the same size... which at least in my case is not true.

Before this plugin can hope to move forward it seems upstream docker-java needs to have swarm mode enabled https://github.com/docker-java/docker-java/pull/717

The kubernetes plugin looks promising if you've got a k8 cluster already running https://github.com/jenkinsci/kubernetes-plugin I'm going to give that a shot and see how it performs

goffinf commented 7 years ago

@cpoole I hadn't noticed your PR. That looks promising but given the discussion thread, do you think it's worthwhile using your patched version i.e. are you going to maintain it along with improvements in the master YADP impl (what are we on, rc38 now - what version is you patch currently using) ?

I hadn't fully appreciated how the multiple Cloud strategy works. Can you provide a simple explanation please. For example, if I have 2 Clouds each pointing to a host instance, if I wanted a particular jobs types to be launched on either, would I create a template of the same name on both Clouds and would the plugin then be able to choose either based on whatever strategy it uses. I had thought, perhaps wrongly, that all template names had to be unique and jobs targeted them explicitly in the node name of the pipeline ?

BTW I agree with you point that maintaining a broken strategy in the face of backwards compatibility seems pointless.

cpoole commented 7 years ago

@goffinf I'm not the author of that PR, that would be @samrocketman

But your description is correct. you define two clouds, each pointing to a different docker host. then within each cloud define a template with the same label and settings. Then the provisioner will schedule containers on the "cloud" that is currently least used.

samrocketman commented 7 years ago

...rc38 now - what version is you patch currently using)

The patch (and built binary) is based on rc38. It is safe to swap between my version and rc38 (or swap back). Safe, because it doesn't change any configuration format. It only adjusts the provisioning algorithm.

All docker hosts do not need to be the same size. As long as a scheduled build label expression matches the labels configured in a docker template then it will be considered for provisioning. If you have different sized docker hosts then I recommend adjusting the provisioning limits on that host. I've had different sized hosts and limited it to as few as one docker container for a host.

cpoole commented 7 years ago

@samrocketman sure the provisioning limits is fine if your workloads are all relatively similar. But if your workloads are highly variable - large multithreaded compilation vs simple python unit tests - then limits will only tell a small portion of the story. I could easily have one container eat up more resources than a cluster of 10 containers. Unless I'm misunderstanding what you mean by "provisioning limits".

I'm not trying to belittle your work. I love it and will probably roll it out to production this week, but its just something to be aware of.

samrocketman commented 7 years ago

I didn't feel you belittled my work. I guess I was just thinking of container cap and not really considering the size in terms of number of cpus or RAM.

I believe you can still compensate with labels. For example, let's say you have the following docker hosts and labels.

Ubuntu 16.04
- AWS m4.xlarge (4cpu 16ram). Labels: docker, ubuntu1604, m4_xlarge
- m4.2xlarge (8cpu 32ram). Labels: docker, ubuntu1604, m4_xlarge, m4_2xlarge
- m3.2xlarge (8cpu 30ram). Labels: docker, ubuntu1604, m3_2xlarge, ssd

Then you can use label expressions in jobs to get the right host. For example,

docker && ubuntu1604 (this job doesn't care the size or speed, it just requires Ubuntu)
docker && ubuntu1604 && m4_xlarge (this job requires at least an m4.xlarge but it's possible to also run on an m4.2xlarge because of the matching labels)

Not sure if you've thought to diversify your infrastructure with labels like that but I've run Jenkins instances which have four different operating system types on diverse hardware, some agents even physical.

cpoole commented 7 years ago

@samrocketman Thats a good call. If i were starting fresh I'd probably consider it. but as of now It would be a lot of work to submit PR's to ALL of our repos. I think the dev leads would kill me :) Thanks again for the PR, let me know if you need any help testing future tweaks

samrocketman commented 7 years ago

I think additional production testing is a good thing. There might be another tweak I'd want to make to it (like parallel scheduling) but in general; I would be okay with it being merged as is once I have better tests written for it.

goffinf commented 7 years ago

@samrocketman I'm going to give this a go in our non-prod environment tomorrow. Insofar as testing, what behaviours should I particularly pay attention to.

Are there any changes to the cloud or template properties or to job DSL or is everything entirely transparent?

I'll feed back any observations to this thread.

Regards

Fraser.

samrocketman commented 7 years ago

Hi @goffinf, there aren't any changes as far as logic. I'm going to cut a new build because I stand corrected. My build was based on rc37. I'm going to add a new build and tag based on rc38 now.

I'll post another comment with a link when it's available.

KostyaSha commented 7 years ago

I think you can pick hpi from jitpack if you specify hash from refs/pulls/167/merge latest commit.
Strategy is not a strategy if not choosable, let's better think how to make strategy as extension point. And if i correctly remember compound-slaves adding cloud that proxies logic to other clouds. Maybe strategy could be done at all as separate Cloud class that will contain list of DockerCloud classes. The only possibly failing place that i see Slave/Node getCloud() -> Cloud for getting docker connection. But that could be tuned.
if current one by one strategy doesn't provision for the same label cloud B when cloud A is full, then it's bug, maybe because of start-up speed-up changes. Need verify/debug.

samrocketman commented 7 years ago

@goffinf here's the new build -> https://github.com/samrocketman/yet-another-docker-plugin/releases/tag/0.1.0-rc38-samrocketman-rc1

You can enable a logger for com.github.kostyasha.yad.DockerProvisioningStrategy to see additional debug output for the strategy being used.

goffinf commented 7 years ago

@samrocketman Thx for the updated build. Tried it out and it's works exactly as you described and is definitely an improvement on the defaul strategy.

This is a step in the right direction but I still favour using the capability of a docker scheduler such as swarm-mode to handle the distribution of containers across a cluster of nodes. It is so much easier to scale and to benefit from the resilience OOTB rather than than coupling multiple Clouds and templates to associated nodes. It moves us back towards 'cattle' deployment with simple scaling policies.

To that end I tried out using the 'legacy' docker swarm with a consul cluster (didn't need registrator for container events) and as @cpoole said this works as you would expect, with the swarm manager distributing the slave containers across the cluster using the default spread pattern (which you can change if you want). Although a more complex build (even though It's relatively straight-forward to automate with Terraform, Ansible or your tool of choice) it is probably a better approach since it provides a nicer separation of concerns, i.e. YADP deals with Jenkins job management and swarm with workload distribution.

@KostyaSha the sweet spot clearly is to support the swarm mode service API to simplify the build further. Have you had an opportunity to consider this further in both docker-java and YADP ? .. I am not at all concerned about the fact that we would be ignoring service level scaling.

@cpoole How did you get on with your testing of K8s ?

Regards

Fraser.

KostyaSha commented 7 years ago

the 'legacy' docker swarm w

It's not legace. It's called 'classical docker swarm' < that what docker devs suggesting to use for build infrastructure. Swarm mode is really designed for cloud ready application, while jenkins and it's slaves are "static" and not HA by design.

.. I am not at all concerned about the fact that we would be ignoring service level scaling.

I'm not ignoring, i just don't use it. I did research for standard jenkins builds, cases and talked with docker people. We found that because of docker builds the best choice is classical swarm. But i don't reject swarm mode. I guess it should be not so difficult to run slave as a service if it just another way of running image and all the most annoying parts with launching are already solved. Btw nobody answered how they plan to build images while using swarm mode. Probably in hacky dind way like with k8s?

For classical swarm there are some fixes in docker-java master and integration tests started failing... And for YAD i would also need some swarm mode setup to test that plugin works. Last time in docker-java i spent a lot of time preparing scripts :(

KostyaSha commented 7 years ago

UPD, swarm discussion should be here, i'm getting confused with parallel discussions. https://github.com/KostyaSha/yet-another-docker-plugin/issues/54

KostyaSha / yet-another-docker-plugin

Load Balanced Slave Cluster #183