TritonDataCenter / sdc-docker

Docker Engine for Triton
Mozilla Public License 2.0
182 stars 49 forks source link

ubuntu:latest networking doesn't seem to work #136

Open Smithx10 opened 5 years ago

Smithx10 commented 5 years ago

While attempting to use the ubuntu:latest docker image the following doesnt work. I believe something in the networking does not work. 16:04 does work.

[Mon 19/01/07 17:37 EST][pts/5][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh/2 2745 % td ps | grep ubuntu
cf19202b113f        ubuntu:latest         "/bin/sh"                7 days ago          Up 7 days           0.0.0.0:8080->8080/tcp                                               fervent_lamarr
90080a69a2e1        ubuntu:latest         "/bin/sh"                8 days ago          Up 8 days           0.0.0.0:8080->8080/tcp                                               angry_joliot
[Mon 19/01/07 17:37 EST][pts/5][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh/2 2745 % td exec fervent_lamarr apt-get update
Err:1 http://archive.ubuntu.com/ubuntu bionic InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:2 http://security.ubuntu.com/ubuntu bionic-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu bionic-updates InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu bionic-backports InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Reading package lists...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/bionic-backports/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/bionic-security/InRelease  Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
Smithx10 commented 5 years ago

I just tested centos:latest and yum update -y worked fine. I believe this is just something to do with the new ubuntu images.

Smithx10 commented 5 years ago

Alpine seems to work also:

[Mon 19/01/07 17:48 EST][pts/5][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh/2 2765 [130] % td run -d -it --name=alpinelatest -p 8081:8081 -m 1gb alpine:latest /bin/sh
Unable to find image 'alpine:latest' locally
latest: Pulling from alpine (req 119d4462-30a0-4cfc-8d0e-836f49d9b5cd)
cd784148e348: Pull complete
Digest: sha256:3d2e482b82608d153a374df3357c0291589a61cc194ec4a9ca2381073a17f58e
Status: Downloaded newer image for alpine:latest
64592b20829ec8879047e06f72c6da6fb5e2b460810042359088eea51e4a8e19
[Mon 19/01/07 17:49 EST][pts/5][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh/2 2766 % td ps | grep alpine
64592b20829e        alpine:latest         "/bin/sh"                28 seconds ago      Up 19 seconds       0.0.0.0:8081->8081/tcp                                               alpinelatest
[Mon 19/01/07 17:49 EST][pts/5][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh/2 2767 % td exec alpinelatest apk update
fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.8/community/x86_64/APKINDEX.tar.gz
v3.8.2-13-g106f36ecbb [http://dl-cdn.alpinelinux.org/alpine/v3.8/main]
v3.8.2-8-g684f341f68 [http://dl-cdn.alpinelinux.org/alpine/v3.8/community]
OK: 9545 distinct packages available
ad-m commented 5 years ago

Could you show "ip addr" output? Could you ping any host via IP address only (exclude DNS issue)?

Smithx10 commented 5 years ago

@ad-m :( Sadly there are no net tools in the image, or ping :( . I think I can move over a binary from a different instance running 16:04 and see....

Smithx10 commented 5 years ago
zsh 2706 % ls
[Mon 19/01/07 19:30 EST][pts/6][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh/2 2724 [130] % td exec fervent_lamarr which ping
/bin/ping
[Mon 19/01/07 19:30 EST][pts/6][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh/2 2725 % td exec fervent_lamarr ping 8.8.8.8
[Mon 19/01/07 19:30 EST][pts/6][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh/2 2726 [127] %

Looks like nothing :(

Smithx10 commented 5 years ago

@jasonbking from IRC stated the following:

<jbk> This space for rent could be sendmmsg
6:56 PM IIRC some newer glibcs are using it for dns
6:56 PM (and isn't supported w/ lx yet)
mgerdts commented 5 years ago

@Smithx10 you should be able to use the native networking tools in /native/*/bin.

A while back one of the cloud-init devs was working on some changes to cloud-init that were specific to lx. It could be something with that. I've not looked at how we normally plumb up networking for lx/docker, so it is quite possible that cloud-init is always out of the picture for lx networking.

Smithx10 commented 5 years ago

@mgerdts

Yes, Looks like that is working. But the default behaviour of apt-get in the container isn't working.

I'lll step through the newer versions from 16.04 and see when we hit the issue.

[Mon 19/01/07 21:14 EST][pts/0][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh 2710 [25] % td exec -it fervent_lamarr /native/usr/sbin/ping google.com
google.com is alive
Smithx10 commented 5 years ago

It seem's like this behaviour arrived in the docker image ubuntu:17.10

All the versions up until this ran apt-get update just fine.

[Mon 19/01/07 21:23 EST][pts/0][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2]
<smith@arch-nix:~>
zsh 2718 % td exec -it ubuntu1710 apt-get update -y
Err:1 http://security.ubuntu.com/ubuntu artful-security InRelease
  Temporary failure resolving 'security.ubuntu.com'
Err:2 http://archive.ubuntu.com/ubuntu artful InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:3 http://archive.ubuntu.com/ubuntu artful-updates InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Err:4 http://archive.ubuntu.com/ubuntu artful-backports InRelease
  Temporary failure resolving 'archive.ubuntu.com'
Reading package lists... Done
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/artful/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/artful-updates/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/artful-backports/InRelease  Temporary failure resolving 'archive.ubuntu.com'
W: Failed to fetch http://security.ubuntu.com/ubuntu/dists/artful-security/InRelease  Temporary failure resolving 'security.ubuntu.com'
W: Some index files failed to download. They have been ignored, or old ones used instead.
Smithx10 commented 5 years ago

So looking at this https://wiki.ubuntu.com/ArtfulAardvark/ReleaseNotes , it looks like there are a few things that stand out.

Most likely being

Network configuration
ifupdown has been deprecated in favor of netplan and is no longer present on new installs. The installer will generate a configuration file for netplan in /etc/netplan, which will set up the system to configure the network via systemd-networkd or NetworkManager. Desktop users will see their system fully managed via NetworkManager as it has been the case in previous releases, but Server users now have their network devices managed via systemd-networkd on new installs. This only applies to new installations.

Given that ifupdown is no longer installed by default, its commands will not be present: ifup and ifdown are thus unavailable, replaced by ip link set $device up and ip link set $device down.

The networkctl command is also available for users to see a summary of the network devices. networkctl status will display the current global state of IP addresses on the system; and networkctl status $device can display the details specific to a network device.

For more information about netplan, please refer to the manual page using the man 5 netplan command.

@mgerdts I don't believe sdc-docker is using cloud-init... so the following probably doesn't apply... but I will note it here for easier reference.

cloud-init
The version was updated to 17.1. Notable new features include:

Python 3.6 support
Ec2 support for IPv6 instance configuration
Expedited boot time through cloud-id optimization
Support for netplan yaml in cloud-init
Add cloud-init subcommands collect-logs, analyze and schema for developers
Apport integration from cloud-init via ‘ubuntu-bug cloud-init’
Significant unittest and integration test coverage improvements
Smithx10 commented 5 years ago

While checking the docker-init process... the interface is definitely being plumbed correctly on the illumos side.... I don't see any issues.... and the fact that /native/ tools can route packets means this is most likely a ubuntu userspace issue.... probably DNS, if I had to guess.

Log from 17.10

[root@00-0c-29-1e-ac-7c (us-east-1) /zones/851d0dc2-7fdd-ed8a-d079-fc6f33b47b63/root/var/log]# cat sdc-dockerinit.log
2019-01-08T02:23:20.226Z MDATA sdc:brand=lx
2019-01-08T02:23:20.226Z MOUNT /dev/shm (shm)
2019-01-08T02:23:20.227Z REPLACE /etc/mtab
2019-01-08T02:23:20.227Z INFO setting up networking
2019-01-08T02:23:20.227Z INFO started ipmgmtd[75784]
2019-01-08T02:23:20.234Z INFO ipmgmtd[75784] exited: 0
2019-01-08T02:23:20.235Z PLUMB lo0
2019-01-08T02:23:20.235Z RAISE[lo0] addr=127.0.0.1, netmask=255.0.0.0
2019-01-08T02:23:20.239Z MDATA sdc:nics=[{"interface":"eth0","mac":"90:b8:d0:a9:9b:66","vlan_id":2,"nic_tag":"sdc_overlay/9501526","gateway":"192.168.128.1","gateways":["192.168.128.1"],"netmask":"255.255.252.0","ip":"192.168.128.143","ips":["192.168.128.143/22"],"network_uuid":"4b609af0-4310-4177-975e-e27f353992e2","mtu":8500},{"interface":"eth1","mac":"90:b8:d0:0f:58:89","vlan_id":10,"nic_tag":"external","gateway":"10.1.10.1","gateways":["10.1.10.1"],"netmask":"255.255.255.0","ip":"10.1.10.100","ips":["10.1.10.100/24"],"network_uuid":"50c48e19-a55b-4af8-9f06-c430f96c37ed","mtu":1500,"primary":true}]
2019-01-08T02:23:20.240Z PLUMB eth0
2019-01-08T02:23:20.242Z RAISE[eth0] addr=192.168.128.143, netmask=255.255.252.0
2019-01-08T02:23:20.751Z PLUMB eth1
2019-01-08T02:23:20.753Z RAISE[eth1] addr=10.1.10.100, netmask=255.255.255.0
2019-01-08T02:23:21.260Z ROUTE[eth1] gw=10.1.10.1, dst=0.0.0.0
2019-01-08T02:23:21.260Z MDATA sdc:routes=[]
2019-01-08T02:23:21.261Z MDATA docker:noipmgmtd=true
2019-01-08T02:23:21.261Z INFO ipmgmtd PID is 75786
2019-01-08T02:23:21.261Z KILLED ipmgmtd[75786]
2019-01-08T02:23:21.277Z INFO network setup complete
2019-01-08T02:23:21.278Z INFO no metadata for 'docker:nfsvolumes'
2019-01-08T02:23:21.278Z No docker:nfsvolumes, nothing to mount
2019-01-08T02:23:21.278Z MDATA sdc:hostname=851d0dc27fdd
2019-01-08T02:23:21.278Z INFO setting hostname = '851d0dc27fdd'
2019-01-08T02:23:21.278Z INFO no metadata for 'docker:user'
2019-01-08T02:23:21.279Z INFO passwd.pw_name: root
2019-01-08T02:23:21.279Z INFO passwd.pw_uid: 0
2019-01-08T02:23:21.279Z INFO passwd.pw_gid: 0
2019-01-08T02:23:21.279Z INFO passwd.pw_dir: /root
2019-01-08T02:23:21.279Z INFO group.gr_name: root
2019-01-08T02:23:21.279Z INFO group.gr_gid: 0
2019-01-08T02:23:21.279Z INFO no metadata for 'docker:workdir'
2019-01-08T02:23:21.279Z WORKDIR '/'
2019-01-08T02:23:21.279Z MDATA docker:linkEnv=[]
2019-01-08T02:23:21.280Z MDATA docker:env=["PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"]
2019-01-08T02:23:21.280Z ENV[0] TERM=xterm
2019-01-08T02:23:21.280Z ENV[1] HOME=/root
2019-01-08T02:23:21.280Z ENV[2] PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
2019-01-08T02:23:21.280Z ENV[3] HOSTNAME=851d0dc27fdd
2019-01-08T02:23:21.280Z MDATA docker:entrypoint=[]
2019-01-08T02:23:21.281Z MDATA docker:cmd=["/bin/sh"]
2019-01-08T02:23:21.281Z ARGV[0]:CMD "/bin/sh"
2019-01-08T02:23:21.281Z MDATA docker:tty=true
2019-01-08T02:23:21.281Z INFO zfd_ready() took 0 loops
2019-01-08T02:23:21.281Z MDATA docker:open_stdin=true
2019-01-08T02:23:21.281Z SWITCHING TO /dev/zfd/*
2019-01-08T02:23:21.282Z INFO open(/dev/zfd/0) SUCCESS on attempt 0
2019-01-08T02:23:21.282Z INFO open(/dev/zfd/0) SUCCESS on attempt 0
2019-01-08T02:23:21.282Z INFO open(/dev/zfd/0) SUCCESS on attempt 0
2019-01-08T02:23:21.282Z MDATA docker:logdriver=json-file
2019-01-08T02:23:21.282Z INFO logdriver json-file
2019-01-08T02:23:21.282Z INFO no metadata for 'docker:wait_for_attach'
2019-01-08T02:23:21.282Z EXECNAME "/bin/sh"
2019-01-08T02:23:21.282Z DROP PRIVS
Smithx10 commented 5 years ago

Look's like this may effect more than just Triton...

https://github.com/docker/libnetwork/issues/2068

Smithx10 commented 5 years ago

For the record.... debian:latest is working just fine. So I believe this is only an ubuntu thing.

[Mon 19/01/07 22:04 EST][pts/0][x86_64/linux-gnu/4.19.2-arch1-1-ARCH][5.6.2] smith@arch-nix:/git/scratch/illumos-joyent/usr/src/lib/brand/lx/zone zsh 2735 (git)-[master]-% td exec -it deblatest apt-get update -y Ign:1 http://cdn-fastly.deb.debian.org/debian stretch InRelease Get:2 http://cdn-fastly.deb.debian.org/debian stretch-updates InRelease [91.0 kB] Get:3 http://security-cdn.debian.org/debian-security stretch/updates InRelease [94.3 kB] Get:4 http://cdn-fastly.deb.debian.org/debian stretch Release [118 kB] Get:5 http://cdn-fastly.deb.debian.org/debian stretch Release.gpg [2434 B] Get:6 http://cdn-fastly.deb.debian.org/debian stretch-updates/main amd64 Packages [5152 B] Get:7 http://security-cdn.debian.org/debian-security stretch/updates/main amd64 Packages [464 kB] Get:8 http://cdn-fastly.deb.debian.org/debian stretch/main amd64 Packages [7089 kB] Fetched 7864 kB in 2s (3590 kB/s)

Smithx10 commented 5 years ago

Don't know if this is related or unrelated but andyf in irc mentioned the following issue in omnios.

https://github.com/omniosorg/illumos-omnios/issues/331

justindthomas commented 5 years ago

Just to state the obvious (for the search engine); this issue currently applies to 18.04 as well.

twhiteman commented 5 years ago

It seems that DNS is broken inside the zone for the ubuntu tools.

For a workaround, I hard coded the apt ip addresses:

echo 91.189.88.149 security.ubuntu.com archive.ubuntu.com >> /etc/hosts

which allowed apt update to work, but I was unable to perform apt upgrade:

# apt upgrade
...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Segmentation fault (core dumped)
Segmentation fault (core dumped)
dpkg: error processing package libc-bin (--configure):
 installed libc-bin package post-installation script subprocess returned error exit status 139
Errors were encountered while processing:
 libc-bin
E: Sub-process /usr/bin/dpkg returned an error code (1)

Native DNS is working correctly inside the zone:

# /native/usr/sbin/ping www.google.com
www.google.com is alive

Even if I run ldconfig, it crashes:

# ldconfig
Segmentation fault (core dumped)

So it seems some internal library (or libraries) like libc are not working correctly - I would hazard a guess that it's due to a difference in the LX implementation for certain system call(s).

Smithx10 commented 5 years ago

Tim Classic from IRC ran into this issue in his NixOS and suggested to try the following.

@tim:stoo.org Smithx10: Does it work if you put `options single-request` in /etc/resolv.conf? After doing this, DNS resolution worked. This should probably help pin point the issue.
timclassic commented 5 years ago

It's been a while since I figured this out, but IIRC the problem was IPv6-related, and I think options single-request is a workaround that papers over the underlying issue by way of causing two sequential requests instead of two in parallel.

twhiteman commented 5 years ago

These changes (committed to OmniOS) may well fix this issue: https://github.com/omniosorg/illumos-omnios/pull/443

timclassic commented 5 years ago

https://github.com/omniosorg/illumos-omnios/pull/443 looks promising!

Interestingly, I recently dug through my own deployment code and found that I left myself the following comment in a resolv.conf destined for a SmartOS Docker container:

# The single-request option works around the lack of sendmmsg() syscall
# support in SmartOS's lx-brand ABI emulation--otherwise, getaddrinfo()
# would try to use it.
options single-request timeout:2 attempts:2 ndots:2

My apologies for not finding this the last time I commented here.

mgerdts commented 5 years ago

I've opened OS-7754 to track this in Jira.

https://smartos.org/bugview/OS-7754

liv3010m commented 5 years ago
# apt upgrade
...
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Segmentation fault (core dumped)
Segmentation fault (core dumped)
dpkg: error processing package libc-bin (--configure):
 installed libc-bin package post-installation script subprocess returned error exit status 139
Errors were encountered while processing:
 libc-bin
E: Sub-process /usr/bin/dpkg returned an error code (1)

Hi guys,

Is there a fix to this problem?