docker / for-linux

Docker Engine for Linux
https://docs.docker.com/engine/installation/
753 stars 85 forks source link

SLUB: Unable to allocate memory on node -1 #774

Open Vesyrak opened 5 years ago

Vesyrak commented 5 years ago

Expected behavior

K8s/Docker works without a hitch on Ubuntu 16.04.

Actual behavior

When dockers are running on the server, the following errors are generated by dmesg.

[319003.331580] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[319003.331587]   cache: mnt_cache(9946:ea4c61d01895b46bf04a9b8c54602a4a6fff12ca7341b3b21f879414c120da79), object size: 384, buffer size: 384, default order: 2, min order: 0
[319003.331591]   node 0: slabs: 20, objs: 776, free: 0
[319003.331594]   node 1: slabs: 14, objs: 556, free: 0
[319940.222707] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[319940.222714]   cache: blkdev_ioc(9946:ea4c61d01895b46bf04a9b8c54602a4a6fff12ca7341b3b21f879414c120da79), object size: 104, buffer size: 104, default order: 0, min order: 0
[319940.222718]   node 0: slabs: 2, objs: 78, free: 0
[319940.222721]   node 1: slabs: 4, objs: 156, free: 0
[320001.028578] SLUB: Unable to allocate memory on node -1 (gfp=0x2080020)
[320001.028582]   cache: kmalloc-128(9946:ea4c61d01895b46bf04a9b8c54602a4a6fff12ca7341b3b21f879414c120da79), object size: 128, buffer size: 128, default order: 1, min order: 0
[320001.028585]   node 0: slabs: 19, objs: 1216, free: 0
[320001.028587]   node 1: slabs: 18, objs: 1152, free: 0
[320004.629230] SLUB: Unable to allocate memory on node -1 (gfp=0x2088020)
[320004.629236]   cache: mnt_cache(9946:ea4c61d01895b46bf04a9b8c54602a4a6fff12ca7341b3b21f879414c120da79), object size: 384, buffer size: 384, default order: 2, min order: 0
[320004.629239]   node 0: slabs: 24, objs: 912, free: 0
[320004.629241]   node 1: slabs: 18, objs: 692, free: 0

Eventually, the server crashes (after about 3-4 days since first docker boot) and the last thing that can be seen in the kern.log are the SLUB errors.

Related problems I found: https://pingcap.com/blog/try-to-fix-two-linux-kernel-bugs-while-testing-tidb-operator-in-k8s/ https://github.com/opencontainers/runc/issues/1725 https://github.com/kubernetes/kubernetes/issues/61937#issuecomment-417265738

However, these issues are related to CentOS and not Ubuntu. Additionally, these issues claim tasks are blocked, which doesn't happen according to our dmesg.

Steps to reproduce the behavior

Deploy a Docker + K8s + Rancher setup

Output of docker version:

root@worker07:~# docker version
Client:
 Version:           18.09.8
 API version:       1.39
 Go version:        go1.10.8
 Git commit:        0dd43dd87f
 Built:             Wed Jul 17 17:41:19 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          18.09.8
  API version:      1.39 (minimum version 1.12)
  Go version:       go1.10.8
  Git commit:       0dd43dd
  Built:            Wed Jul 17 17:07:25 2019
  OS/Arch:          linux/amd64
  Experimental:     false

Output of docker info:

Containers: 177
 Running: 95
 Paused: 0
 Stopped: 82
Images: 61
Server Version: 18.09.8
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-159-generic
Operating System: Ubuntu 16.04.6 LTS
OSType: linux
Architecture: x86_64
CPUs: 48
Total Memory: 125.8GiB
Name: worker07
ID: MQAU:C6U4:ZKSC:EBB5:PVVE:B64W:BQGK:KSUR:6CYX:K6KV:STGJ:GCS5
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 REDACTED:30002
 127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine

WARNING: No swap limit support

The servers are Dell Poweredges R430. One has been configured as a master, the other as a slave, both have the problem. These are new servers on which a clean 16.04 image was installed.

Any idea on what could be the cause would be greatly appreciated

thaJeztah commented 5 years ago

ping @kolyshkin PTAL - could this be a bug in the Ubuntu kernel as well?

joschi commented 4 years ago

We've encountered the same problem on Ubuntu 16.04.6 LTS:

$ uname -a
Linux my-hostname 4.4.0-1096-aws #107-Ubuntu SMP Thu Oct 3 01:51:58 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
$ docker version
Client: Docker Engine - Community
 Version:           19.03.4
 API version:       1.40
 Go version:        go1.12.10
 Git commit:        9013bf583a
 Built:             Fri Oct 18 15:53:51 2019
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.4
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.10
  Git commit:       9013bf583a
  Built:            Fri Oct 18 15:52:23 2019
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.10
  GitCommit:        b34a5c8af56e510852c35414db4c1f4fa6172339
 runc:
  Version:          1.0.0-rc8+dev
  GitCommit:        3e425f80a8c931f88e6d94a8c831b9d5aa481657
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683
$ docker info
Client:
 Debug Mode: false

Server:
 Containers: 2
  Running: 2
  Paused: 0
  Stopped: 0
 Images: 2
 Server Version: 19.03.4
 Storage Driver: aufs
  Root Dir: /var/lib/docker/aufs
  Backing Filesystem: extfs
  Dirs: 32
  Dirperm1 Supported: true
 Logging Driver: json-file
 Cgroup Driver: cgroupfs
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: b34a5c8af56e510852c35414db4c1f4fa6172339
 runc version: 3e425f80a8c931f88e6d94a8c831b9d5aa481657
 init version: fec3683
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 4.4.0-1096-aws
 Operating System: Ubuntu 16.04.6 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 4
 Total Memory: 7.453GiB
 Name: my-hostname
 ID: IE6A:7VYA:YNOZ:SFG2:5KSQ:AMVB:2OMV:FXII:XS2D:XR2J:63P5:TTZA
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
  provider=amazonec2
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

WARNING: No swap limit support
WARNING: the aufs storage-driver is deprecated, and will be removed in a future release.
$ dpkg -l docker-ce
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                                                           Version                              Architecture                         Description
+++-==============================================================-====================================-====================================-=================================================================================================================================
ii  docker-ce                                                      5:19.03.4~3-0~ubuntu-xenial          amd64                                Docker: the open-source application container engine
Oct 31 03:19:56 my-hostname dockerd[1221]: time="2019-10-31T03:19:56.095504904Z" level=warning msg="failed to retrieve runc version: exit status 2"
Oct 31 03:19:56 my-hostname kernel: [222542.473969] SLUB: Unable to allocate memory on node 0 (gfp=0x2088020)
Oct 31 03:19:56 my-hostname kernel: [222542.473973]   cache: blkdev_ioc(1639:9e1903e879a86489ae8491e5ed5dd7192461efc7e7fc3ea529d833230ca4d87f), object size: 104, buffer size: 104, default order: 0, min order: 0
Oct 31 03:19:56 my-hostname kernel: [222542.473975]   node 0: slabs: 4, objs: 156, free: 0
Oct 31 03:19:56 my-hostname kernel: [222542.484236] SLUB: Unable to allocate memory on node 0 (gfp=0x2088020)
Oct 31 03:19:56 my-hostname kernel: [222542.484239]   cache: blkdev_ioc(1639:9e1903e879a86489ae8491e5ed5dd7192461efc7e7fc3ea529d833230ca4d87f), object size: 104, buffer size: 104, default order: 0, min order: 0
Oct 31 03:19:56 my-hostname kernel: [222542.484241]   node 0: slabs: 4, objs: 156, free: 0
LinuxLover9 commented 4 years ago

Also having this issue with

Docker version 19.03.8, build afacb8b7f0
chris@study:~$ docker version
Client: Docker Engine - Community
 Version:           19.03.8
 API version:       1.40
 Go version:        go1.12.17
 Git commit:        afacb8b7f0
 Built:             Wed Mar 11 01:25:58 2020
 OS/Arch:           linux/amd64
 Experimental:      false

Server: Docker Engine - Community
 Engine:
  Version:          19.03.8
  API version:      1.40 (minimum version 1.12)
  Go version:       go1.12.17
  Git commit:       afacb8b7f0
  Built:            Wed Mar 11 01:24:30 2020
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.2.13
  GitCommit:        7ad184331fa3e55e52b890ea95e65ba581ae3429
 runc:
  Version:          1.0.0-rc10
  GitCommit:        dc9208a3303feef5b3839f4323d9beb36df0a9dd
 docker-init:
  Version:          0.18.0
  GitCommit:        fec3683

on Ubuntu 16.04.6 LTS : Linux version 4.4.0-176-generic
Guess I need to try and upgrade Ubuntu? As I see not much has happened on this issue...

sorenmat commented 4 years ago

This seems to be related to kernel version 4.4.X, from what I've been able to google try upgrading the kernel, perhaps

Wood-Xia commented 2 years ago

Have the similar issue

er crashes (after abo

Hi, @Vesyrak may I know what's kind of crashes?

We have the similar problem:

  1. quite a lot of SLUB: Unable to allocate memory on node -1 errors on dmesg, nearly 754 such error in 24 hours.
  2. also many oom-killer on some specified process(restart after killer), nearly 70 times in 24 hours.
  3. observed EXT4-fs error (device dm-0) in ext4_truncate:3932: Out of memory happen, then Aborting journal on device dm-0-8.
  4. finally, file system goes to read-only with message EXT4-fs (dm-0): Remounting filesystem read-only.

Above problem happens on the same server during past 3 months, we have to reboot the server then fix the file system to recover it.

Attachd the dmesg log.

Belowing is the setup info:

Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:17:20 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:15:30 2018
  OS/Arch:      linux/amd64
  Experimental: false

docker info
Containers: 589
 Running: 531
 Paused: 0
 Stopped: 58
Images: 773
Server Version: 18.03.1-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.4.0-131-generic
Operating System: Ubuntu 16.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 80
Total Memory: 125.3GiB
Name: supOS
ID: Y6P6:4PJV:U32M:QXUE:75ZN:6A2K:XASZ:JOBJ:QXNN:TW5B:EVZZ:757V
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 registry.supos.ai
 registry:5000
 192.168.20.20:5000
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

uname -a
Linux supOS 4.4.0-131-generic #157-Ubuntu SMP Thu Jul 12 15:51:36 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

dmesg.zip