Closed dhiltgen closed 8 years ago
It seems we need the bypass
mode for this and similar scenarios.
@dhiltgen this is not to solve your problem as I have been facing it too.
My workaround in the meantime is to have a local registry running on the same machine as the manager. After build, I push the image to the registry first. When pulling, it's not triggering the OOM for me.
FYI, my image is ~200M but my cluster size is 50-node.
Generally speaking, we should have some sort of Memory and CPU cap for the Manager process that can be controlled somehow. This is especially needed as docker-machine runs the Manager on the same machine than the Agents and we don't want the Manager to eat up all the resources (I saw that this is also a general practice, users are launching a Manager alongside an Agent).
Using the registry is a good workaround for this even though we still want save
and load
to work as expected without crashing the Manager daemon. This probably means queuing save
and load
requests and execute them in parallel as long as we stay under the CPU and memory cap.
Also the amount of resources used by the Manager should be deducted from the available CPU and RAM in the output of docker info
if running on the same machine than an Agent (merging the Manager and the Agent seems like a good option for this). Imho the Manager should be a special case of a local scheduler: each command that goes through the Manager should account for the resources used and we should check that the Manager process stays light enough to not hinder the overall resources available on the machine.
Weird ...
load/save are streaming without buffering to ram
DockerClient makes memory copies of the image to be uploaded. We may want to change the implementation.
@jimmyxian can you please take a look ?
@vieux Sure, trying to solve recently on this issue. @dongluochen Any good advice on this?
@jimmyxian I just added a PR https://github.com/samalba/dockerclient/pull/192 . Please take a look.
@dongluochen Good point. Load
is not streaming in dockerclient.
Also, maybe another problem is here(https://github.com/docker/swarm/blob/master/cluster/swarm/cluster.go#L511)
client --> swarm
is fast, swarm --> engines
is slow, the result may be same as above case. WDYT?
@jimmyxian I think that is taken care of by io.Copy
- it stores only a buffer and won't read until the writes are done
@jimmyxian Under stream mode, io.Copy
is dominated by the slowest writer. Swarm manager doesn't use more memory, it is as slow as the slowest connection, which is fine. The real problem with streaming mode is a broken node would fail load
operation because the io.Pipe is closed at io.Copy. We may need to see if node refreshment is adequate for such problem.
In #1464 I didn't have any failed nodes - just fresh cluster with 2 nodes and both were fine.
@dhiltgen @nazar-pc #1494 updates dockerclient to LoadImage
in stream mode. Please rebuild and test if this fixes your problem. There should not be memory spike from the command. The side effect is unhealthy nodes can block load
command. I think the right solution is to fail nodes faster.
Is there nightly build of Swarm image in Docker Hub to try with Docker Machine?
@nazar-pc Yes, you can use: dockerswarm/swarm:latest
@dongluochen, I confirm that both save and load work fine now.
Fixed by #1494
I'm running a 17 node cluster, and when I do a Save/Load of an image I just built on the cluster (to distribute it throughout the cluster so I can run it on any node) the manager memory usage spikes. For example, a ~550M image pushes the resident size of the manager up to 19G. When I try to run a few of these Save/Load scenarios concurrently, I can easily get the manager big enough to trigger the kernel's OOM killer and kill the manager.