Has this been tested using goroutines?

surdaft commented 9 months ago

Has this been tested with goroutines to improve speed on larger swarms?

I have used Shepherd but it turned out to be too slow, going one-by-one, which looks like this does the same. However implementing go routines with a wait group would be a good way to update larger swarms quicker.

My example is that I have a swarm where there is 22 services, each service is the same image with a slightly different configuration and traefik labels. Shepherd takes about 15 minutes to update all those services. Thinking this could handle it much quicker with the use of goroutines?

In my testing it is possible to run updates concurrently, for different services.

shizunge commented 9 months ago

Question? When you saying run updates concurrently, do you mean checking whether there is an update or do you mean downloading image and updating the service?

If the later one, can you share some data to show that it really helps?

I am asking because I have concern, when we run updating in parallel, CPU/network/memory may become a bottleneck. I have a service with a very large image. It usually take a couple minutes to update. However if the host also running some other workload, the updating time could be 5x more.

The problem of Shepherd is that it runs both docker manifest inspect and docker service update, docker manifest inspect is fast but docker service update is slow. The way to improve is check whether there is no new image and if no new image, then avoid running the update command that is slow.

I am not reading this swarm-updater code, so I am not sure if it already takes approach.

surdaft commented 8 months ago

Question? When you saying run updates concurrently, do you mean checking whether there is an update or do you mean downloading image and updating the service?

Downloading and updating concurrently, in my case it is a lot of lighter weight containers which would allow for me to update many of them at the same time.

I am asking because I have concern, when we run updating in parallel, CPU/network/memory may become a bottleneck

In the PR I raised (#12) I did implement the functionality to specify how many threads to spawn though, to mitigate this kind of issue if your containers have a larger footprint.

However if the host also running some other workload, the updating time could be 5x more.

I have actually been running a build of this branch on a production monitoring stack for some time (doing a daily update at 7:30am each morning) and seems to be working quite well. With loki, promtail and tempo updating this morning within a few seconds. This stack has 2 nodes and distributes loki onto the non-manager instance to avoid it taking all the RAM.

Your point regarding if it really is worth while definitely makes me want to add performance metrics on the PR so that it is easier for users, like me, to fine tune the thread count and monitor how long it takes to actually do everything.

shizunge commented 8 months ago

Add another thought: if multiple services have the same image, they should be updated in serial instead of in parallel.

This would avoid the case that multiple services tries to download the same image simultaneously. Thus it saves network bandwidth and disk IO.

surdaft commented 8 months ago

Add another thought: if multiple services have the same image, they should be updated in serial instead of in parallel.

This would avoid the case that multiple services tries to download the same image simultaneously. Thus it saves network bandwidth and disk IO.

That's a fair point, Ill update it to pull all the images before we do an update. So then 5 updates to the same image don't happen at the same time. Thanks for the input!

shizunge commented 8 months ago

Add another thought: if multiple services have the same image, they should be updated in serial instead of in parallel. This would avoid the case that multiple services tries to download the same image simultaneously. Thus it saves network bandwidth and disk IO.

That's a fair point, Ill update it to pull all the images before we do an update. So then 5 updates to the same image don't happen at the same time. Thanks for the input!

How do you pull images on multiple nodes? I am thinking to run updating in serial due to not all services running on the all nodes. That means not all nodes need all images.

Rush commented 8 months ago

Add another thought: if multiple services have the same image, they should be updated in serial instead of in parallel.

I actually reported it to @shizunge's Gantry project that it was updating my stack way too slow :-)

I use the same image to spawn different services. What changes is the command. So one command would have command: server/api-server.cjs and another one would have command: server/queue-server.cjs. This is efficient as 99.9% of the image is about dependencies, not the code itself.

In our case, updating all at once would make most sense as running services on diverging versions is potentially dangerous.

Rush commented 8 months ago

That's a fair point, Ill update it to pull all the images before we do an update. So then 5 updates to the same image don't happen at the same time. Thanks for the input!

In my use case, more parallelism is better as we're leveraging Gitlab as the registry and there are no limits.

At least within a single stack, the more atomically it can be updated, the better. docker service inspect shows the stack name.

            "TaskTemplate": {
                "ContainerSpec": {
                    "Labels": {
                        "com.docker.stack.namespace": "<STACK_NAME>"
                    },

shizunge commented 8 months ago

Add another thought: if multiple services have the same image, they should be updated in serial instead of in parallel. This would avoid the case that multiple services tries to download the same image simultaneously. Thus it saves network bandwidth and disk IO.

That's a fair point, Ill update it to pull all the images before we do an update. So then 5 updates to the same image don't happen at the same time. Thanks for the input!

I have tried running multiple docker pull on the same image on the same host. It seems docker daemon just downloaded the image once. and multiple docker pull shared the progress.

I have not tried multiple docker update on new images yet, but I guess it probably same as the docker pull, i.e. docker will not download the same image multiple times in parallel.

So likely I am over thinking.

codestation / swarm-updater

Has this been tested using goroutines? #11