TritonDataCenter / illumos-joyent

Community developed and maintained version of the OS/Net consolidation
http://www.illumos.org/projects/illumos-gate
266 stars 109 forks source link

Docker containers that use s6-overlay broke? #166

Open sjorge opened 6 years ago

sjorge commented 6 years ago

Now that docker v2 images are support, I started playing around with it. But looks like s6-overlay based images like plexinc/pms-docker don't seem to work.

Steps to reproduce:

imgadm import plexinc/pms-docker:plexpass
vmadm create -f plex_docker.json

{
  "alias": "artemis-docker",
  "hostname": "artemis-docker.example.org",
  "image_uuid": "3e63d007-621c-3313-10a2-5b7eeb208abe",
  "nics": [
    {
      "nic_tag": "trunk",
      "primary": true,
      "mtu": 1500,
      "vlan_id": 10,
      "ips": [ "10.xx.xx.98/24" ],
      "gateways": [ "10.xx.xx.1" ]
    }
  ],
  "brand": "lx",
  "docker": "true",
  "kernel_version": "3.13.0",
  "max_physical_memory": 2048,
  "maintain_resolvers": true,
  "resolvers": [
    "10.xx.xx.1"
  ],
  "quota": 15,
  "internal_metadata": {
    "docker:env": "[\"HOME=/config\", \"TZ=Europe/Brussels\"]",
    "docker:entrypoint": "[\"/init\"]"
  }
}
sjorge commented 6 years ago

docker.log

[s6-init] making user provided files available at /var/run/s6/etc...
exited 0.
[s6-init] ensuring user provided files have correct perms...
exited 0.
[fix-attrs.d] applying ownership & permissions fixes...
[fix-attrs.d] done.
[cont-init.d] executing container initialization scripts...
[cont-init.d] 40-plex-first-run: executing...
Creating pref shell
Attempting to obtain server token from claim token
% Total    %
Received % Xferd  Average Speed   Time
Time     Time  C
urrent

Dload  Upload
T
otal   Spent
Left  Spe
ed
  0
0    0
0    0
0      0      0 -
-:--:-- --:--:
-- --:--:--
0
100
1  100     1    0     0      2      0 --
:--:-- --:--:-- --:--:--     2
Plex Media Server first run setup complete
[cont-init.d] 40-plex-first-run: exited 0.
[cont-init.d] 50-plex-update: executing...
Attempting to upgrade to: 1.12.0.4829-6de959918
% Total
% Received % Xferd  Average S
peed   Time    Time
Time  Current

Dload  Upload   Total   Spen
t    Left  Speed
0     0    0     0
0     0      0
0 --:--:-- --:--:
-- --:--:--     0
100   17
9  100   179    0     0    489
0 --:--:-- -
-:--:-- --:-
-:--   490
100   179  100
179    0
0    489
0 --:--:-
- --:--:-- -
-:--:--   4
89
11  103M   1
1 12.4M    0     0  9323k      0  0:00:11  0:00:
01  0:00:10 9323k
33  103M   3
3 35.2M    0     0  14.9M      0  0:00:06
0:00:02  0:00:04 22.8M
55  103M   5
5 57.7M    0     0  17.1M      0  0:00:06  0:00:03  0:0
0:03 22.6M
75  103M   75 78.1M    0     0
17.9M      0  0:00:05  0:00:04  0:00:01 21.9M
96  103M
96  100M    0     0  18.7M      0  0:00:05  0:00:05 -
-:--:-- 22.0M
100  103M  100
103M    0     0  18.8M      0  0:00:05  0:00:05
--:--:-- 22.1M
Selecting previously unselected package plexmediaserver.
(Reading database ... 7548 files and directories currently installed.)
Preparing to unpack /tmp/plexmediaserver.deb ...
Unpacking plexmediaserver (1.12.0.4829-6de959918) ...
Setting up plexmediaserver (1.12.0.4829-6de959918) ...
##################################################################
#  NOTE: Your system does not have udev installed. Without udev  #
#        you won't be able to use DVBLogic's TVButler for DVR    #
#        or for LiveTV                                           #
#                                                                #
#        Please install udev and reinstall Plex Media Server to  #
#        to enable TV Butler support in Plex Media Server.       #
#                                                                #
#        To install udev run: sudo apt-get install udev          #
#                                                                #
##################################################################
Processing triggers for systemd (229-4ubuntu21.1) ...
[cont-init.d] 50-plex-update: exited 0.
[cont-init.d] done.
[services.d] starting services

The services never get started, when poking around inside the zone I noticed this:

root@artemis-docker:~# ps -xfv
  PID TTY      STAT   TIME  MAJFL   TRS   DRS   RSS %MEM COMMAND
    1 ?        S      0:00      0     0  7200  2476  0.1 /bin/sh /init
83446 ?        S      0:00      0     0  2900  1344  0.0 s6-svscan -t0 /var/run/s6/services
83466 ?        S      0:00      0     0  2868  1300  0.0  \_ foreground  if   /etc/s6/init/init-stage2-redirfd   foreground    if     if      s6-echo      -n      --      [s6-init] making user provided files available at /var/ru
83471 ?        S      0:00      0     0  2868  1300  0.0  |   \_ if  /etc/s6/init/init-stage2-redirfd  foreground   if    if     s6-echo     -n     --     [s6-init] making user provided files available at /var/run/s6/etc...
83472 ?        S      0:00      0     0  2868  1300  0.0  |       \_ foreground  if   if    s6-echo    -n    --    [s6-init] making user provided files available at /var/run/s6/etc...      foreground    backtick    -n    S6_RUNT
83477 ?        S      0:00      0     0  2864  1296  0.0  |           \_ if  if  -t   s6-test   -d   /var/run/s6/etc/services.d    if   s6-echo   [services.d] starting services    if   pipeline    s6-ls    -0    --    /var/run/s
83714 ?        S      0:00      0     0  2864  1296  0.0  |               \_ if  pipeline   s6-ls   -0   --   /var/run/s6/etc/services.d    forstdin  -0  -p  --  i  importas  -u  i  i  if   s6-test   -d   /var/run/s6/etc/service
83718 ?        R      1:03      0     0  2876  1304  0.0  |                   \_ forstdin -0 -p -- i importas -u i i if  s6-test  -d  /var/run/s6/etc/services.d/${i}  s6-hiercopy /var/run/s6/etc/services.d/${i} /var/run/s6/servi
83720 ?        Z      0:00      0     0     0     0  0.0  |                       \_ [s6-hiercopy] <defunct>
83719 ?        Z      0:00      0     0     0     0  0.0  |                       \_ [s6-ls] <defunct>
83467 ?        S      0:00      0     0  2868  1300  0.0  \_ s6-supervise s6-fdholderd
84124 pts/5    Ss     0:00      0     0 70440  3548  0.1 /bin/login -h zone:global -f
84133 pts/5    S      0:00      0     0 20980  3768  0.1  \_ -bash
84147 pts/5    R      0:00      0     0 28300  3084  0.1      \_ ps -xfv

For some reason the s6-ls and s6-hiercopy seems to fail at boot, and no services get started.

twhiteman commented 6 years ago

I think the image (imgadm import) seems to have downloaded correctly - at a guess it could be the network settings (as I think the plex server will download bits on startup).

Can you zlogin into the zone and try to curl something? E.g.

curl http://www.google.com
plitc commented 6 years ago

same issue

[root@assg15-labor /zones/template]# cat 10.ADMIN-lx-docker-plex.json
{
  "brand": "lx",
  "kernel_version": "3.16.0",
  "image_uuid": "4312dc68-0c0c-b559-702d-c13ace5171b4",
  "autoboot": true,
  "alias": "ADMIN-lx-docker-plex",
  "hostname": "ADMIN-lx-docker-plex",
  "delegate_dataset": true,
  "dns_domain": "test.local",
  "resolvers": [
    "8.8.8.8",
    "8.8.4.4"
  ],
  "max_physical_memory": 4096,
  "max_swap": 4096,
  "tmpfs": 4096,
  "quota": 25,
  "cpu_cap": 100,
  "cpu_shares": 100,
  "max_lwps": 2000,
  "nics": [
    {
      "nic_tag": "admin",
      "ip": "1xx.xxx..xxx.xxx",
      "netmask": "255.255.255.0",
      "gateway": "1xx.xxx.xxx.20",
      "primary": true
    }
  ],
  "docker": "true",
  "internal_metadata": {
    "docker:entrypoint": "[\"/init\"]",
    "docker:cmd": "[\"/healthcheck.sh || exit 1\"]",
    "docker:env": "[\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\", \"TERM=xterm\", \"LANG=en_US.UTF-8\", \"LC_ALL=C.UTF-8\", \"CHANGE_CONFIG_DIR_OWNERSHIP=true\",  \"HOME=/config\"]",
    "docker:workingdir": "/data",
    "docker:workdir": "/data",
    "docker:tty": true,
    "docker:attach_stdin": true,
    "docker:attach_stdout": true,
    "docker:attach_stderr": true,
    "docker:open_stdin": true
  }
}
[root@assg15-labor /zones/template]#

root@ADMIN-lx-docker-plex:/# ps -ax
  PID TTY      STAT   TIME COMMAND
12439 pts/16   R      0:00 ps -ax
12390 ?        S      0:00 if  pipeline   s6-ls   -0   --   /var/run/s6/etc/services.d    forstdin  -0  -p  --  i  importas  -u  i  i  if   s6-test   -d   /var/run/s6/etc/services.d/${i}    s6-hiercopy  /var/run/s6/etc/serv
12192 ?        Ssl    0:00 ipmgmtd
12411 pts/16   S      0:00 -bash
12394 ?        Z      0:00 [s6-ls] <defunct>
12402 pts/16   Ss     0:00 /bin/login -h zone:global -f
12395 ?        Z      0:00 [s6-hiercopy] <defunct>
12393 ?        R      2:04 forstdin -0 -p -- i importas -u i i if  s6-test  -d  /var/run/s6/etc/services.d/${i}  s6-hiercopy /var/run/s6/etc/services.d/${i} /var/run/s6/services/${i}
12237 ?        S      0:00 if  /etc/s6/init/init-stage2-redirfd  foreground   if    if     s6-echo     -n     --     [s6-init] making user provided files available at /var/run/s6/etc...        foreground     backtick     -n
12243 ?        S      0:00 if  if  -t   s6-test   -d   /var/run/s6/etc/services.d    if   s6-echo   [services.d] starting services    if   pipeline    s6-ls    -0    --    /var/run/s6/etc/services.d      forstdin   -0   -p
    1 ?        S      0:00 s6-svscan -t0 /var/run/s6/services
12238 ?        S      0:00 foreground  if   if    s6-echo    -n    --    [s6-init] making user provided files available at /var/run/s6/etc...      foreground    backtick    -n    S6_RUNTIME_PROFILE     printcontenv     S6_R
12233 ?        S      0:00 s6-supervise s6-fdholderd
12232 ?        S      0:00 foreground  if   /etc/s6/init/init-stage2-redirfd   foreground    if     if      s6-echo      -n      --      [s6-init] making user provided files available at /var/run/s6/etc...          foregrou
root@ADMIN-lx-docker-plex:/# ./healthcheck.sh
curl: (7) Couldn't connect to server
root@ADMIN-lx-docker-plex:/# cat /healthcheck.sh
#!/bin/sh -e

TARGET=localhost
CURL_OPTS="--connect-timeout 15 --silent --show-error --fail"

curl ${CURL_OPTS} "http://${TARGET}:32400/identity" >/dev/null

root@ADMIN-lx-docker-plex:/#
sjorge commented 6 years ago

The network works fine, the plex service can be manually started with a lot of fiddling.

twhiteman commented 6 years ago

Your right, a recent change (i.e. newer SmartOS platform) must have broken this.

It ran fine on the 201706 platform that I used for testing - but latest (20180323T002504Z) doesn't work correctly - shows the same issue you reported.

sjorge commented 6 years ago

I poked at them with truss but did not get anywhere, they don't drop cores for as far as I can tell. So not much more info I was able to gather, maybe some dtrace could help but I'm not good with that.

sjorge commented 6 years ago

Ok so it's not limited to just the plex docker image, I found another one that uses s6 that also has the problem, emby/embyserver:latest.

zrhutto commented 6 years ago

If the additional evidence is useful here, diginc/pi-hole (running a combination of dnsmasq and a couple of other services, also using s6) has been broken as well after a platform upgrade sometime around the new year. I’ve been too busy to diagnose the issue further since starting the daemons manually has provided a work-around, although I’d be happy to pull logs if it’d be helpful.

On Mar 25, 2018, at 10:05 AM, Jorge Schrauwen notifications@github.com wrote:

Ok so it's not limited to just the plex docker image, I found another one that uses s6 that also has the problem, emby/embyserver:latest.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

twhiteman commented 6 years ago

@jjelinek I did some investigation (platform bisecting) - it seems commit b036e0fd (https://smartos.org/bugview/OS-6467) to be the root cause of this issue.

I tested a platform build without that change:

sdcadm platform install -C experimental 0c35c502-3a3f-498a-b734-316a6af675bd

and all works correctly again.

I don't understand all of what's occurring in the LX vm (in the plex init setup), but it seems the issue is related to the fork/exec process cleanup, as processes are getting stuck in a "defunct" state and the parent process is stuck waiting on the child process(es) to finish, which never occurs.

Note the the forstdin seems to be stuck in the "sigsuspend" call:

# ptree -z 87e1160c-7a74-e576-be3a-fb644b6bd57c
22523 zsched
  22599 s6-svscan -t0 /var/run/s6/services
    22677 foreground  if   /etc/s6/init/init-stage2-redirfd   foreground    if     if
      22682 if  /etc/s6/init/init-stage2-redirfd  foreground   if    if     s6-echo     -n
        22683 foreground  if   if    s6-echo    -n    --    [s6-init] making user provided fi
          22688 if  if  -t   s6-test   -d   /var/run/s6/etc/services.d    if   s6-echo   [servi
            22850 if  pipeline   s6-ls   -0   --   /var/run/s6/etc/services.d    forstdin  -0  -p
              22853 forstdin -0 -p -- i importas -u i i if  s6-test  -d  /var/run/s6/etc/services.d
                22854 <defunct>
                22855 <defunct>
    22678 s6-supervise s6-fdholderd

# pstack 22853
22853:  forstdin -0 -p -- i importas -u i i if  s6-test  -d  /var/run/s6/etc/s
 0000000000000000 sigsuspend (7fffef10ebc0)
sjorge commented 6 years ago

OS-6898 seemed related, but a PI with that commit present has the same symptom.

sjorge commented 6 years ago

Reminder for myself to look at this again on a release PI next time I am booted on one. This problem is gone for me running on a debug build I did myself that is completely up to date with the repos as of today.