Closed anandkumarpatel closed 4 years ago
I am afraid that the code here returns false: https://github.com/docker/swarm/blob/v1.2.0/scheduler/strategy/weighted_node.go#L63
if cpuScore <= 100 && memoryScore <= 100 {
But have no idea why.
In the function weighNodes
, the parameter of nodes
can not be nil
. The reason is that if node
is nil, the err will be errNoNodeAvailable = errors.New("No nodes available in the cluster")
, like below:
https://github.com/docker/swarm/blob/v1.2.0/scheduler/scheduler.go#L51-L53
if len(accepted) == 0 {
return nil, errNoNodeAvailable
}
Alright so after much digging with https://github.com/CodeNow/swarm/pull/1/files I found out issue. Turns out the image we were trying to create only existed on one host (there was an error pushing it to the registry). that host was full so we received "no resources available".
However this is very misleading because on image not found swarm automatically add an image constraint instead of returning 404. https://github.com/docker/swarm/blob/v1.2.0/cluster/swarm/cluster.go#L147
I think the error message needs to be modified in this case where the first create container failed and swarm added image affinity. option 1: add flag to disable auto image finding, return 404 image found option 2: change error message in this case to say something like "no resources available on any engine with specified image"
Which do you think is best?
This should have been fixed in #1796 (by picking option 2). I'll try to take a look, perhaps we missed printing some constraints there.
@nishanttotla It looks like there is no print when swarm auto adds a hard image affinity. Repo steps:
Have 2 servers setup with swarm.
On one server build any image with a special tag docker build -t special .
Next on the same server with the image you created above create a container that is allocated all the memory in the box docker run -m 100g busybox
Then via swarm try to create a container from that special image. Here I notices there are no logs saying swarm can't schedule due to image not found on another server or that it was constrained to attempt only on a specific server .
docker run special
Had something similar happen to me. The container was created in the swarm but returned an error: "no resources available to schedule container". But I was able to go back and do docker start container_id
and it started without any complaints. In my case the node already had the image pulled.
Edit:
Containers: 4
Running: 4
Paused: 0
Stopped: 0
Images: 2
Server Version: swarm/1.2.3
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 1
d-rethinkdb-5c25b78b-be00-4d06-ab9d-b33f51c68b39: 54.242.92.51:2376
└ ID: 7ZU3:32RV:6EQO:HM6T:4PW2:QQTE:6TQ2:YJJJ:UVUX:OZVS:MNBR:T5Y3
└ Status: Healthy
└ Containers: 4
└ Reserved CPUs: 0 / 2
└ Reserved Memory: 6 GiB / 7.669 GiB
└ Labels: executiondriver=, kernelversion=4.2.0-18-generic, operatingsystem=Ubuntu 15.10, provider=amazonec2, storagedriver=aufs
└ UpdatedAt: 2016-07-01T22:03:16Z
└ ServerVersion: 1.11.2
Plugins:
Volume:
Network:
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 2
Total Memory: 7.669 GiB
Name: b1cd5a11d445
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support
I get the same error. Since we dont pass the cpu shares the reserved cpu i shown as 0 and host still has a lot of memory. Why swarm returns with no resource available error
@shashankmjain was the docker image you are trying to start available on the machine that has lots of memory? Also what version of swam are you running?
Hi, We use Swarm 1.2.5 and yes image is there on the host as other containers on the host use same image. Also there is lot of memory on the host and it shows lot of free memory.
@shashankmjain Can you show docker -H swarm_manager:swarm_port info
, and the command that fails with no resources available to schedule container
?
Hi @dongluochen,
I also met the issue after upgrading docker to 1.13.0. And I tried building the latest swarm, and it didn't work. The problem only appeared when --memory option is given with an value other than 0.
docker -H :3375 run --name busybox7 --memory="128m" busybox docker: Error response from daemon: no resources available to schedule container. See 'docker run --help'.
$ docker version Client: Version: 1.13.0 API version: 1.25 Go version: go1.7.3 Git commit: 49bf474 Built: Wed Jan 18 16:20:26 2017 OS/Arch: linux/amd64
Server: Version: 1.13.0 API version: 1.25 (minimum version 1.12) Go version: go1.7.3 Git commit: 49bf474 Built: Wed Jan 18 16:20:26 2017 OS/Arch: linux/amd64 Experimental: false
$docker images |grep swarm
swarm4dk v20170210built 4358a506ec08 20 minutes ago 319 MB
$ docker -H swarm_manager:swarm_port info Containers: 44 Running: 2 Paused: 0 Stopped: 42 Images: 2 Server Version: swarm/1.2.5 Role: primary Strategy: spread Filters: health, port, containerslots, dependency, affinity, constraint Nodes: 1 n2m-dev1-dk: 192.168.14.112:2375 └ ID: 7GZU:RWX5:T4HR:D3XB:VVCS:7SSL:S5BX:5QF2:7UX7:UPZD:QAF3:PAWX └ Status: Healthy └ Containers: 44 (2 Running, 0 Paused, 42 Stopped) └ Reserved CPUs: 0 / 4 └ Reserved Memory: 28.12 GiB / 8.186 GiB └ Labels: kernelversion=4.7.4-1.el7.elrepo.x86_64, label_host=192.168.14.112:2375, label_node=192.168.14.112, operatingsystem=CentOS Linux 7 (Core), storagedriver=overlay └ UpdatedAt: 2017-02-10T07:25:36Z └ ServerVersion: 1.13.0 Plugins: Volume: Network: Swarm: NodeID: Is Manager: false Node Address: Kernel Version: 3.10.0-327.36.3.el7.x86_64 Operating System: linux Architecture: amd64 CPUs: 4 Total Memory: 8.186 GiB Name: 59c2b3a49c34 Docker Root Dir: Debug Mode (client): false Debug Mode (server): false WARNING: No kernel memory limit support Experimental: false Live Restore Enabled: false
@huahouye
It looks like you ran out of RAM on your server.
└ Reserved Memory: 28.12 GiB / 8.186 GiB
Keep in mind containers started directly on the host (bypassing swarm) with memory constraints will affect reserved memory counts.
Also stopped containers with reserved memory still count to memory used.
Agree with @anandkumarpatel. If you delete the stopped containers you should see used memory
decrease.
└ Containers: 44 (2 Running, 0 Paused, 42 Stopped)
This is what Swarm info gives. We have 2 nodes , for some reason despite a spread strategy, the load is just being pushed to one node which is running out of memory..
Swarm info
{ "ID": "", "Containers": 83, "ContainersRunning": 34, "ContainersPaused": 0, "ContainersStopped": 49, "Images": 14, "Driver": "", "DriverStatus": null, "SystemStatus": [ [ "Role", "primary" ], [ "Strategy", "spread" ], [ "Filters", "health, port, dependency, affinity, constraint" ], [ "Nodes", "2" ], [ " a887b8b4-b753-4332-bc75-b8b983dd4707", "10.11.241.1:4243" ], [ " └ ID", "2O3T:VTYD:6VDO:AU5O:EIAZ:2LPX:WBWR:AHXL:LRZA:T7DD:7CCY:MNU4" ], [ " └ Status", "Healthy" ], [ " └ Containers", "18 (18 Running, 0 Paused, 0 Stopped)" ], [ " └ Reserved CPUs", "0 / 3" ], [ " └ Reserved Memory", "2.793 GiB / 11.7 GiB" ], [ " └ Labels", "kernelversion=4.4.0-53-generic, operatingsystem=Ubuntu 14.04.5 LTS, storagedriver=aufs" ], [ " └ UpdatedAt", "2017-02-10T10:28:35Z" ], [ " └ ServerVersion", "1.12.3" ], [ " ab1285a7-6e2d-4784-b169-53043fb72748", "10.11.241.0:4243" ], [ " └ ID", "MRJB:UV2H:UIDG:SO26:HPAA:TKML:KQB2:UJM4:PP66:4GAW:ZFHT:ZKFO" ], [ " └ Status", "Healthy" ], [ " └ Containers", "65 (16 Running, 0 Paused, 49 Stopped)" ], [ " └ Reserved CPUs", "0 / 3" ], [ " └ Reserved Memory", "11.28 GiB / 11.7 GiB" ], [ " └ Labels", "kernelversion=4.4.0-53-generic, operatingsystem=Ubuntu 14.04.5 LTS, storagedriver=aufs" ], [ " └ UpdatedAt", "2017-02-10T10:28:32Z" ], [ " └ ServerVersion", "1.12.3" ] ], "Plugins": { "Volume": null, "Network": null, "Authorization": null }, "MemoryLimit": true, "SwapLimit": true, "KernelMemory": false, "CpuCfsPeriod": true, "CpuCfsQuota": true, "CPUShares": true, "CPUSet": true, "IPv4Forwarding": true, "BridgeNfIptables": true, "BridgeNfIp6tables": true, "Debug": false, "NFd": 0, "OomKillDisable": true, "NGoroutines": 0, "SystemTime": "2017-02-10T10:29:10.023354827Z", "ExecutionDriver": "", "LoggingDriver": "", "CgroupDriver": "", "NEventsListener": 0, "KernelVersion": "4.4.0-53-generic", "OperatingSystem": "linux", "OSType": "", "Architecture": "amd64", "IndexServerAddress": "", "RegistryConfig": null, "NCPU": 6, "MemTotal": 25121452032, "DockerRootDir": "", "HttpProxy": "", "HttpsProxy": "", "NoProxy": "", "Name": "b24b3b46-4ebc-4bec-87ef-57386e729fa4", "Labels": null, "ExperimentalBuild": false, "ServerVersion": "swarm/1.2.4", "ClusterStore": "", "ClusterAdvertise": "", "SecurityOptions": null }
On Fri, Feb 10, 2017 at 11:25 PM, Dongluo Chen notifications@github.com wrote:
Agree with @anandkumarpatel https://github.com/anandkumarpatel. If you delete the stopped containers you should see used memory decrease.
└ Containers: 44 (2 Running, 0 Paused, 42 Stopped)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarm/issues/2166#issuecomment-279016450, or mute the thread https://github.com/notifications/unsubscribe-auth/ADzK1h8tlkGh6kYy0brs3TVY10Ahnezvks5rbKR8gaJpZM4IPfLF .
@shashankmjain did you ensure the image + tag you are trying to run is available on both machines? Even with the spread strategy swam will only schedule on a machine that has the image first even if it is almost out of RAM. Only if there is no more free resources on a machine with the image will it try to schedule on a new machine and pull the image.
Hi, I think image affinity only applies when its part of the Docker run. Else to my knowledge it doesnt get applied. In https://github.com/docker/swarm/blob/master/scheduler/filter/affinity.go code it checks if affinity is part of ContainerConfig. In my case we dont specify any affinity for the image.
On Sun, Feb 12, 2017 at 8:12 PM, Anandkumar Patel notifications@github.com wrote:
@shashankmjain https://github.com/shashankmjain did you ensure the image + tag you are trying to run is available on both machines? Even with the spread strategy swam will only schedule on a machine that has the image first even if it is almost out of RAM. Only if there is no more free resources on a machine with the image will it try to schedule on a new machine and pull the image.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarm/issues/2166#issuecomment-279222987, or mute the thread https://github.com/notifications/unsubscribe-auth/ADzK1gPPn9lYEMx6x1lstx2SrcZ_GdCgks5rbxpqgaJpZM4IPfLF .
@anandkumarpatel @dongluochen thank you vary much, I deleted those stopped containers and it works now.
@shashankmjain Check out my comment here https://github.com/docker/swarm/issues/2166#issuecomment-215198184 that's the issue I was hitting swarm auto adds image constraint without you knowing it! It also has a link to the build I used to debug this issue if you are not hitting the image not found case.
We checked, the images exist on the nodes.
On Mon, Feb 13, 2017 at 4:16 PM, Anandkumar Patel notifications@github.com wrote:
@shashankmjain https://github.com/shashankmjain Check out my comment here #2166 (comment) https://github.com/docker/swarm/issues/2166#issuecomment-215198184 that's the issue I was hitting swarm auto adds image constraint without you knowing it! It also has a link to the build I used to debug this issue if you are not hitting the image not found case.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarm/issues/2166#issuecomment-279352434, or mute the thread https://github.com/notifications/unsubscribe-auth/ADzK1mb5tPssXRZ6qhRdF9JWpaPlXtsTks5rcDSBgaJpZM4IPfLF .
@shashankmjain interesting, I'm out of ideas. Next steps would be to start swarm in debug mode and look into the logs.
We are hitting this issue in our system where we get "no resources available to schedule container" error however when we do swarm info right after there is plenty of resources.
swarm debug logs:
swarm info at that time:
^^ has plenty of space for 1gb container
we use remote API to create / start the containers but it is the equivalent of running:
returns error
no resources available to schedule container
Any Idea how this could be happening? or a further way to debug
Nodes info: