TritonDataCenter / sdc-docker

Docker Engine for Triton
Mozilla Public License 2.0
182 stars 49 forks source link

docker service in Triton Headnode docker0 Zone fails to start #101

Closed jussisallinen closed 7 years ago

jussisallinen commented 7 years ago

Headnode platform: - SmartOS (build: 20161123T125110Z) Core dump can be found from here: core.node.90867

I did # sdcadm post-setup docker on fresh Triton Headnode.

Global Zone dmesg shows following when running # sdcadm post-setup docker:

2016-11-24T13:12:30+00:00 headnode svc.ipfd[2997]: [ID 139457 daemon.error] smf_get_state failed for svc:/TEMP/smartdc/dockerlogger:default: entity not found 2016-11-24T13:12:30+00:00 headnode svc.ipfd[2997]: [ID 162284 daemon.error] is_correct_event failed for svc:/TEMP/smartdc/dockerlogger:default. 2016-11-24T13:12:30+00:00 headnode svc.ipfd[2997]: [ID 662829 daemon.error] Service may have incorrect IPfilter configuration

When docker is trying to start in the docker0 Zone following get's logged in the Zones dmesg:

2016-11-24T13:12:46+00:00 8d3fb6f4-d359-45af-ba8f-90acab8cdc13 nscd[16077]: [ID 131150 user.error] nss_mdns: error checking svc:/network/dns/multicast:default service timestamp 2016-11-24T13:12:46+00:00 8d3fb6f4-d359-45af-ba8f-90acab8cdc13 nscd[16077]: [ID 131150 user.error] nss_mdns: error checking svc:/network/dns/multicast:default service timestamp

STATE STIME FMRI maintenance 13:13:06 svc:/smartdc/application/docker:default

[ Nov 24 09:20:34 Executing start method ("/opt/smartdc/docker/smf/method/docker start"). ]

FROM _toss (/opt/smartdc/docker/node_modules/assert-plus/assert.js:22:5) Function.out.(anonymous function) [as string] (/opt/smartdc/docker/node_modules/assert-plus/assert.js:122:17) parseConstructorArguments (/opt/smartdc/docker/node_modules/verror/lib/verror.js:76:18) new VError (/opt/smartdc/docker/node_modules/verror/lib/verror.js:153:11) CueBallDNSResolver.state_process (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/lib/resolver.js:687:13) CueBallDNSResolver.FSM._gotoState (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mooremachine/lib/fsm.js:273:4) CueBallDNSResolver.FSM._gotoState (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mooremachine/lib/fsm.js:300:8) CueBallDNSResolver.FSM._gotoState (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mooremachine/lib/fsm.js:300:8) CueBallDNSResolver.FSM._gotoState (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mooremachine/lib/fsm.js:300:8) CueBallDNSResolver.FSM._gotoState (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mooremachine/lib/fsm.js:300:8) FSMStateHandle.gotoState (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mooremachine/lib/fsm.js:52:23) EventEmitter. (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/lib/resolver.js:393:5) emitTwo (events.js:87:13) EventEmitter.emit (events.js:172:7) onLookup (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/lib/resolver.js:926:6) /opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mname-client/lib/client.js:130:5 DnsMessage. (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mname-client/lib/client.js:218:3) DnsMessage.g (events.js:260:16) emitTwo (events.js:87:13) DnsMessage.emit (events.js:172:7) Socket. (/opt/smartdc/docker/node_modules/moray/node_modules/cueball/node_modules/mname-client/lib/sockets.js:359:7) emitTwo (events.js:87:13) Socket.emit (events.js:172:7) UDP.onMessage (dgram.js:480:8) [ Nov 24 09:20:36 Stopping because all processes in service exited. ] [ Nov 24 09:20:36 Executing stop method (:kill). ] [ Nov 24 09:20:36 Restarting too quickly, changing state to maintenance. ]`

Here's the metadata:

`{ "uuid": "566d6bc2-327f-49ce-a64f-387321672c54", "name": "docker", "application_uuid": "81f6a36c-2871-4340-9085-85ddae0e7a3b", "params": { "billing_id": "0ae33ebc-c216-11e2-9b84-6f7e2a82bc36", "image_uuid": "269cfa9a-b032-11e6-a6b7-cba1698b18f4", "archive_on_delete": true, "delegate_dataset": true, "maintain_resolvers": true, "networks": [ { "name": "admin" }, { "name": "external", "primary": true } ], "firewall_enabled": false, "tags": { "smartdc_role": "docker", "smartdc_type": "core" } }, "metadata": { "SERVICE_NAME": "docker", "SERVICE_DOMAIN": "docker.fi-espoo-.....company.com", "USE_TLS": true, "user-script": "#!/usr/bin/bash\n#\n# This Source Code Form is subject to the terms of the Mozilla Public\n# License, v. 2.0. If a copy of the MPL was not distributed with this\n# file, You can obtain one at http://mozilla.org/MPL/2.0/.\n#\n\n#\n# Copyright (c) 2014, Joyent, Inc.\n#\n\nexport PS4='[\D{%FT%TZ}] ${BASH_SOURCE}:${LINENO}: ${FUNCNAME[0]:+${FUNCNAME[0]}(): }'\n\nset -o xtrace\nset -o errexit\nset -o pipefail\n\n#\n# The presence of the /var/svc/.ran-user-script file indicates that the\n# instance has already been setup (i.e. the instance has booted previously).\n#\n# Upon first boot, run the setup.sh script if present. On all boots including\n# the first one, run the configure.sh script if present.\n#\n\nSENTINEL=/var/svc/.ran-user-script\n\nDIR=/opt/smartdc/boot\n\nif [[ ! -e ${SENTINEL} ]]; then\n if [[ -f ${DIR}/setup.sh ]]; then\n ${DIR}/setup.sh 2>&1 | tee /var/svc/setup.log\n fi\n\n touch ${SENTINEL}\nfi\n\nif [[ ! -f ${DIR}/configure.sh ]]; then\n echo \"Missing ${DIR}/configure.sh cannot configure.\"\n exit 1\nfi\n\nexec ${DIR}/configure.sh\n", "sapi-url": "http://10.65.0.27", "ENABLED_LOG_DRIVERS": "json-file,syslog,none" }, "type": "vm" }

jussisallinen commented 7 years ago

the sdc-docker image version as installed by sdcadm is master-20161121T212616Z-g2c4c35. CoaL installation had working sdc-docker with master-20161116T014617Z-gecd3409. Update channel was dev.

jclulow commented 7 years ago

Note that the ipfilter-related dmesg log messages you see in the global zone are an unrelated, but known, issue: OS-4332.

jussisallinen commented 7 years ago

@jclulow Thanks for the info!

arekinath commented 7 years ago

@jussisallinen So, there's a 3-part story of woe here.

Part 1: sdc-docker/node-moray are currently using cueball without a "binder bootstrap". This means that they are leaking all their DNS lookups out into public DNS, and accepting responses from public DNS.

Part 2: You seem to have a CNAME for *.fi-espoo-....company.com which points at 1.company.com. This is unfortunate, as it means that public DNS is going to provide a valid response for any lookups of internal SDC names (which is something we strongly recommend not to do).

Part 3: cueball has a bug (https://github.com/joyent/node-cueball/issues/53) which means it is not handling NODATA responses on CNAME'd names properly.

All 3 parts of this combine to create your crash -- the DNS SRV request for moray.fi-espoo-...company.com leaks into public DNS, then public DNS answers with a CNAME to a name that has no SRV records (so it gets a NODATA response), and then cueball fails to interpret this response correctly. This will happen inconsistently, though, because sometimes the local binder in the DC will respond to the query more quickly than the public DNS server you have set, and that response will be taken instead.

So, part 3 is a bug, and I'll get on a fix for that ASAP. Part 1 is unfortunate, and I'll see if I can work with the other developers here to get that sorted out. For part 2, I really do recommend that you remove that CNAME if you can. If you absolutely need to put names under that suffix in public DNS, please make sure they don't collide with the names of SDC internal services and CNAME them specifically (not a * record).

jussisallinen commented 7 years ago

@arekinath Thanks for your analysis! I was also thinking that it might be the wildcard DNS entry that is causing the havoc here, part of.