docker-archive / classicswarm

Swarm Classic: a container clustering system. Not to be confused with Docker Swarm which is at https://github.com/docker/swarmkit
Apache License 2.0
5.76k stars 1.08k forks source link

incorrect "no resources available to schedule container" error #2166

Closed anandkumarpatel closed 4 years ago

anandkumarpatel commented 8 years ago

We are hitting this issue in our system where we get "no resources available to schedule container" error however when we do swarm info right after there is plenty of resources.

swarm debug logs:

time="2016-04-25T23:47:45Z" level=debug msg="matching constraint: group==12341234 (soft=false)"
time="2016-04-25T23:47:45Z" level=error msg="HTTP error: no resources available to schedule container" status=500

swarm info at that time:

Containers: 6433
 Running: 1208
 Paused: 0
 Stopped: 5225
Images: 4606
Server Version: swarm/1.2.0
Role: primary
Strategy: spread
Filters: health, port, dependency, affinity, constraint
Nodes: 243
 ip-10-8-192-36: 10.8.192.36:4242
  └ Status: Healthy
  └ Containers: 12
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 6.451 GiB / 8.187 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=3.13.0-79-generic, operatingsystem=Ubuntu 14.04.4 LTS, group=12341234, storagedriver=aufs
  └ Error: (none)
  └ UpdatedAt: 2016-04-25T23:53:55Z
  └ ServerVersion: 1.10.2
 ip-10-8-192-111: 10.8.192.111:4242
  └ Status: Healthy
  └ Containers: 19
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 1.788 GiB / 8.187 GiB
  └ Labels: executiondriver=native-0.2, kernelversion=3.13.0-79-generic, operatingsystem=Ubuntu 14.04.4 LTS, group=12341234, storagedriver=aufs
  └ Error: (none)
  └ UpdatedAt: 2016-04-25T23:54:36Z
  └ ServerVersion: 1.10.2
...(only showing relevant nodes)
Plugins:
 Volume:
 Network:
Kernel Version: 3.13.0-48-generic
Operating System: linux
Architecture: amd64
CPUs: 486
Total Memory: 1.943 TiB
Name: 047712e54a02
Docker Root Dir:
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support

^^ has plenty of space for 1gb container

we use remote API to create / start the containers but it is the equivalent of running:

docker run -m 1g -e constraint:group==12341234 -e constraint:node==~ip-10-8-192-36 busybox

returns error no resources available to schedule container

Any Idea how this could be happening? or a further way to debug

Nodes info:

$ sudo docker -D version;
Client:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64

Server:
 Version:      1.10.2
 API version:  1.22
 Go version:   go1.5.3
 Git commit:   c3959b1
 Built:        Mon Feb 22 21:37:01 2016
 OS/Arch:      linux/amd64
$ sudo docker -D info
Containers: 12
 Running: 12
 Paused: 0
 Stopped: 0
Images: 14
Server Version: 1.10.2
Storage Driver: aufs
 Root Dir: /docker/aufs
 Backing Filesystem: extfs
 Dirs: 71
 Dirperm1 Supported: false
Execution Driver: native-0.2
Logging Driver: json-file
Plugins:
 Volume: local
 Network: bridge null host
Kernel Version: 3.13.0-79-generic
Operating System: Ubuntu 14.04.4 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.797 GiB
Name: ip-10-8-192-36
ID: KMMQ:FV24:I6P5:AXTM:LRXB:6TTR:BSWM:53YP:6UGA:HR5S:6RFY:MYOO
WARNING: No swap limit support
Labels:
 group=466127
$ uname -a
Linux ip-10-8-192-36 3.13.0-79-generic #123-Ubuntu SMP Fri Feb 19 14:27:58 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
allencloud commented 8 years ago

I am afraid that the code here returns false: https://github.com/docker/swarm/blob/v1.2.0/scheduler/strategy/weighted_node.go#L63

if cpuScore <= 100 && memoryScore <= 100 {

But have no idea why.

In the function weighNodes, the parameter of nodes can not be nil. The reason is that if node is nil, the err will be errNoNodeAvailable = errors.New("No nodes available in the cluster"), like below: https://github.com/docker/swarm/blob/v1.2.0/scheduler/scheduler.go#L51-L53

    if len(accepted) == 0 {
        return nil, errNoNodeAvailable
    }
anandkumarpatel commented 8 years ago

Alright so after much digging with https://github.com/CodeNow/swarm/pull/1/files I found out issue. Turns out the image we were trying to create only existed on one host (there was an error pushing it to the registry). that host was full so we received "no resources available".

However this is very misleading because on image not found swarm automatically add an image constraint instead of returning 404. https://github.com/docker/swarm/blob/v1.2.0/cluster/swarm/cluster.go#L147

I think the error message needs to be modified in this case where the first create container failed and swarm added image affinity. option 1: add flag to disable auto image finding, return 404 image found option 2: change error message in this case to say something like "no resources available on any engine with specified image"

Which do you think is best?

nishanttotla commented 8 years ago

This should have been fixed in #1796 (by picking option 2). I'll try to take a look, perhaps we missed printing some constraints there.

anandkumarpatel commented 8 years ago

@nishanttotla It looks like there is no print when swarm auto adds a hard image affinity. Repo steps:

Have 2 servers setup with swarm.

On one server build any image with a special tag docker build -t special . Next on the same server with the image you created above create a container that is allocated all the memory in the box docker run -m 100g busybox

Then via swarm try to create a container from that special image. Here I notices there are no logs saying swarm can't schedule due to image not found on another server or that it was constrained to attempt only on a specific server .

docker run special

zbyte64 commented 8 years ago

Had something similar happen to me. The container was created in the swarm but returned an error: "no resources available to schedule container". But I was able to go back and do docker start container_id and it started without any complaints. In my case the node already had the image pulled.

Edit:

Containers: 4
 Running: 4
 Paused: 0
 Stopped: 0
Images: 2
Server Version: swarm/1.2.3
Role: primary
Strategy: spread
Filters: health, port, containerslots, dependency, affinity, constraint
Nodes: 1
 d-rethinkdb-5c25b78b-be00-4d06-ab9d-b33f51c68b39: 54.242.92.51:2376
  └ ID: 7ZU3:32RV:6EQO:HM6T:4PW2:QQTE:6TQ2:YJJJ:UVUX:OZVS:MNBR:T5Y3
  └ Status: Healthy
  └ Containers: 4
  └ Reserved CPUs: 0 / 2
  └ Reserved Memory: 6 GiB / 7.669 GiB
  └ Labels: executiondriver=, kernelversion=4.2.0-18-generic, operatingsystem=Ubuntu 15.10, provider=amazonec2, storagedriver=aufs
  └ UpdatedAt: 2016-07-01T22:03:16Z
  └ ServerVersion: 1.11.2
Plugins: 
 Volume: 
 Network: 
Kernel Version: 4.2.0-18-generic
Operating System: linux
Architecture: amd64
CPUs: 2
Total Memory: 7.669 GiB
Name: b1cd5a11d445
Docker Root Dir: 
Debug mode (client): false
Debug mode (server): false
WARNING: No kernel memory limit support
shashankmjain commented 7 years ago

I get the same error. Since we dont pass the cpu shares the reserved cpu i shown as 0 and host still has a lot of memory. Why swarm returns with no resource available error

anandkumarpatel commented 7 years ago

@shashankmjain was the docker image you are trying to start available on the machine that has lots of memory? Also what version of swam are you running?

shashankmjain commented 7 years ago

Hi, We use Swarm 1.2.5 and yes image is there on the host as other containers on the host use same image. Also there is lot of memory on the host and it shows lot of free memory.

dongluochen commented 7 years ago

@shashankmjain Can you show docker -H swarm_manager:swarm_port info, and the command that fails with no resources available to schedule container?

huahouye commented 7 years ago

Hi @dongluochen,

I also met the issue after upgrading docker to 1.13.0. And I tried building the latest swarm, and it didn't work. The problem only appeared when --memory option is given with an value other than 0.

docker -H :3375 run --name busybox7 --memory="128m" busybox docker: Error response from daemon: no resources available to schedule container. See 'docker run --help'.

$ docker version Client: Version: 1.13.0 API version: 1.25 Go version: go1.7.3 Git commit: 49bf474 Built: Wed Jan 18 16:20:26 2017 OS/Arch: linux/amd64

Server: Version: 1.13.0 API version: 1.25 (minimum version 1.12) Go version: go1.7.3 Git commit: 49bf474 Built: Wed Jan 18 16:20:26 2017 OS/Arch: linux/amd64 Experimental: false

$docker images |grep swarm

swarm4dk v20170210built 4358a506ec08 20 minutes ago 319 MB

$ docker -H swarm_manager:swarm_port info Containers: 44 Running: 2 Paused: 0 Stopped: 42 Images: 2 Server Version: swarm/1.2.5 Role: primary Strategy: spread Filters: health, port, containerslots, dependency, affinity, constraint Nodes: 1 n2m-dev1-dk: 192.168.14.112:2375 └ ID: 7GZU:RWX5:T4HR:D3XB:VVCS:7SSL:S5BX:5QF2:7UX7:UPZD:QAF3:PAWX └ Status: Healthy └ Containers: 44 (2 Running, 0 Paused, 42 Stopped) └ Reserved CPUs: 0 / 4 └ Reserved Memory: 28.12 GiB / 8.186 GiB └ Labels: kernelversion=4.7.4-1.el7.elrepo.x86_64, label_host=192.168.14.112:2375, label_node=192.168.14.112, operatingsystem=CentOS Linux 7 (Core), storagedriver=overlay └ UpdatedAt: 2017-02-10T07:25:36Z └ ServerVersion: 1.13.0 Plugins: Volume: Network: Swarm: NodeID: Is Manager: false Node Address: Kernel Version: 3.10.0-327.36.3.el7.x86_64 Operating System: linux Architecture: amd64 CPUs: 4 Total Memory: 8.186 GiB Name: 59c2b3a49c34 Docker Root Dir: Debug Mode (client): false Debug Mode (server): false WARNING: No kernel memory limit support Experimental: false Live Restore Enabled: false

anandkumarpatel commented 7 years ago

@huahouye

It looks like you ran out of RAM on your server. └ Reserved Memory: 28.12 GiB / 8.186 GiB

Keep in mind containers started directly on the host (bypassing swarm) with memory constraints will affect reserved memory counts.

anandkumarpatel commented 7 years ago

Also stopped containers with reserved memory still count to memory used.

dongluochen commented 7 years ago

Agree with @anandkumarpatel. If you delete the stopped containers you should see used memory decrease.

└ Containers: 44 (2 Running, 0 Paused, 42 Stopped)
shashankmjain commented 7 years ago

This is what Swarm info gives. We have 2 nodes , for some reason despite a spread strategy, the load is just being pushed to one node which is running out of memory..

Swarm info

{ "ID": "", "Containers": 83, "ContainersRunning": 34, "ContainersPaused": 0, "ContainersStopped": 49, "Images": 14, "Driver": "", "DriverStatus": null, "SystemStatus": [ [ "Role", "primary" ], [ "Strategy", "spread" ], [ "Filters", "health, port, dependency, affinity, constraint" ], [ "Nodes", "2" ], [ " a887b8b4-b753-4332-bc75-b8b983dd4707", "10.11.241.1:4243" ], [ " └ ID", "2O3T:VTYD:6VDO:AU5O:EIAZ:2LPX:WBWR:AHXL:LRZA:T7DD:7CCY:MNU4" ], [ " └ Status", "Healthy" ], [ " └ Containers", "18 (18 Running, 0 Paused, 0 Stopped)" ], [ " └ Reserved CPUs", "0 / 3" ], [ " └ Reserved Memory", "2.793 GiB / 11.7 GiB" ], [ " └ Labels", "kernelversion=4.4.0-53-generic, operatingsystem=Ubuntu 14.04.5 LTS, storagedriver=aufs" ], [ " └ UpdatedAt", "2017-02-10T10:28:35Z" ], [ " └ ServerVersion", "1.12.3" ], [ " ab1285a7-6e2d-4784-b169-53043fb72748", "10.11.241.0:4243" ], [ " └ ID", "MRJB:UV2H:UIDG:SO26:HPAA:TKML:KQB2:UJM4:PP66:4GAW:ZFHT:ZKFO" ], [ " └ Status", "Healthy" ], [ " └ Containers", "65 (16 Running, 0 Paused, 49 Stopped)" ], [ " └ Reserved CPUs", "0 / 3" ], [ " └ Reserved Memory", "11.28 GiB / 11.7 GiB" ], [ " └ Labels", "kernelversion=4.4.0-53-generic, operatingsystem=Ubuntu 14.04.5 LTS, storagedriver=aufs" ], [ " └ UpdatedAt", "2017-02-10T10:28:32Z" ], [ " └ ServerVersion", "1.12.3" ] ], "Plugins": { "Volume": null, "Network": null, "Authorization": null }, "MemoryLimit": true, "SwapLimit": true, "KernelMemory": false, "CpuCfsPeriod": true, "CpuCfsQuota": true, "CPUShares": true, "CPUSet": true, "IPv4Forwarding": true, "BridgeNfIptables": true, "BridgeNfIp6tables": true, "Debug": false, "NFd": 0, "OomKillDisable": true, "NGoroutines": 0, "SystemTime": "2017-02-10T10:29:10.023354827Z", "ExecutionDriver": "", "LoggingDriver": "", "CgroupDriver": "", "NEventsListener": 0, "KernelVersion": "4.4.0-53-generic", "OperatingSystem": "linux", "OSType": "", "Architecture": "amd64", "IndexServerAddress": "", "RegistryConfig": null, "NCPU": 6, "MemTotal": 25121452032, "DockerRootDir": "", "HttpProxy": "", "HttpsProxy": "", "NoProxy": "", "Name": "b24b3b46-4ebc-4bec-87ef-57386e729fa4", "Labels": null, "ExperimentalBuild": false, "ServerVersion": "swarm/1.2.4", "ClusterStore": "", "ClusterAdvertise": "", "SecurityOptions": null }

On Fri, Feb 10, 2017 at 11:25 PM, Dongluo Chen notifications@github.com wrote:

Agree with @anandkumarpatel https://github.com/anandkumarpatel. If you delete the stopped containers you should see used memory decrease.

└ Containers: 44 (2 Running, 0 Paused, 42 Stopped)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarm/issues/2166#issuecomment-279016450, or mute the thread https://github.com/notifications/unsubscribe-auth/ADzK1h8tlkGh6kYy0brs3TVY10Ahnezvks5rbKR8gaJpZM4IPfLF .

anandkumarpatel commented 7 years ago

@shashankmjain did you ensure the image + tag you are trying to run is available on both machines? Even with the spread strategy swam will only schedule on a machine that has the image first even if it is almost out of RAM. Only if there is no more free resources on a machine with the image will it try to schedule on a new machine and pull the image.

shashankmjain commented 7 years ago

Hi, I think image affinity only applies when its part of the Docker run. Else to my knowledge it doesnt get applied. In https://github.com/docker/swarm/blob/master/scheduler/filter/affinity.go code it checks if affinity is part of ContainerConfig. In my case we dont specify any affinity for the image.

On Sun, Feb 12, 2017 at 8:12 PM, Anandkumar Patel notifications@github.com wrote:

@shashankmjain https://github.com/shashankmjain did you ensure the image + tag you are trying to run is available on both machines? Even with the spread strategy swam will only schedule on a machine that has the image first even if it is almost out of RAM. Only if there is no more free resources on a machine with the image will it try to schedule on a new machine and pull the image.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarm/issues/2166#issuecomment-279222987, or mute the thread https://github.com/notifications/unsubscribe-auth/ADzK1gPPn9lYEMx6x1lstx2SrcZ_GdCgks5rbxpqgaJpZM4IPfLF .

huahouye commented 7 years ago

@anandkumarpatel @dongluochen thank you vary much, I deleted those stopped containers and it works now.

anandkumarpatel commented 7 years ago

@shashankmjain Check out my comment here https://github.com/docker/swarm/issues/2166#issuecomment-215198184 that's the issue I was hitting swarm auto adds image constraint without you knowing it! It also has a link to the build I used to debug this issue if you are not hitting the image not found case.

shashankmjain commented 7 years ago

We checked, the images exist on the nodes.

On Mon, Feb 13, 2017 at 4:16 PM, Anandkumar Patel notifications@github.com wrote:

@shashankmjain https://github.com/shashankmjain Check out my comment here #2166 (comment) https://github.com/docker/swarm/issues/2166#issuecomment-215198184 that's the issue I was hitting swarm auto adds image constraint without you knowing it! It also has a link to the build I used to debug this issue if you are not hitting the image not found case.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/docker/swarm/issues/2166#issuecomment-279352434, or mute the thread https://github.com/notifications/unsubscribe-auth/ADzK1mb5tPssXRZ6qhRdF9JWpaPlXtsTks5rcDSBgaJpZM4IPfLF .

anandkumarpatel commented 7 years ago

@shashankmjain interesting, I'm out of ideas. Next steps would be to start swarm in debug mode and look into the logs.