balena-os / balena-supervisor

Balena Supervisor: balena's agent on devices.
https://balena.io
Other
147 stars 63 forks source link

supervisor generates invalid DNS names that break reverse DNS lookups #2077

Open jellyfish-bot opened 1 year ago

jellyfish-bot commented 1 year ago

[ramirogm] [ramirogm] Summary of issue discussed on a forums thread:

A container name longer than 63 chars is causing the issue. From this docker guide Container name configured using --name is used to discover a container within an user-defined docker network. The embedded DNS server maintains the mapping between the container name and its IP address (on the network the container is connected to).

A container was created with name zookeeper_5826140_2403300_72b853b44d0246fd789c89269db742d04a2e5ef9, which is 66 characters long. When doing a ping to that container, an error shows up on journalctl:

Dec 08 20:54:22 e874ef2 balenad[1236]: time="2022-12-08T20:54:22.376867182Z" level=error msg="[resolver] error writing resolver resp, dns: bad rdata"
jellyfish-bot commented 1 year ago

[ramirogm] This has attached https://jel.ly.fish/f9dc24eb-389d-4e7a-905f-b33d6e4c2b5e

pipex commented 1 year ago

So if I understand correctly, ping zookeper above fails because ping does a reverse dns lookup which breaks if the container name is too long, even though the container name was never really used, is that correct @cywang117 @ramirogm

Instead of shortening the commit in the container name, which we cannot do right now, maybe we can tell the engine not to use the container name for DNS resolution? Not sure if that's an option

cywang117 commented 1 year ago

I replied to the user in that ticket above, which corresponds to https://forums.balena.io/t/352814, but I wasn't able to reproduce the DNS resolution issue when switching the base image used by zookeeper from registry.access.redhat.com/ubi8/ubi-minimal to balenalib/intel-nuc-ubuntu:bionic. Perhaps this is related to the base image itself?

pipex commented 1 year ago

it's possible that is a behavior of a specific build or version of ping

cywang117 commented 1 year ago

ubi-minimal doesn't even ship with ping; I had to install it. I had to switch to the regular ubi image and install it with yum install iputils. The ubi image's ping version is ping utility, iputils-s20180629 and the other test services' ping versions are ping utility, iputils-s20161105.

cywang117 commented 1 year ago

Clarification -- I switched the image from ubi-minimal to ubi and installed ping, and was able to ping back and forth between all services. So it may just be a feature of ubi-minimal.

cywang117 commented 1 year ago

Closing as the issue isn't caused by the Supervisor, but by a specific base image.

cywang117 commented 1 year ago

Reopening as I can't determine quite yet whether the issue is due to the base image.

ramirogm commented 1 year ago

Hi @cywang117 @pipex

I'll try to explain in more detail what we found while working on the support thread:

The issue is related to the container name being longer than 63 chars, and the fact that the docker embedded DNS resolver is used. The embedded DNS resolver logs an error with invalid DNS names that can be seen in jounrnalctl output.

I can reproduce the example using balenalib/intel-nuc-ubuntu:bionic images. I've enabled support on https://dashboard.balena-cloud.com/devices/29eba2b51334368a8be289dacc9c1b6b which is running:

version: "2.1"

services:
  test1:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  test2:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  zookeeperXXXX64:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  zookeeperXXX63:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  zookeeperXX62:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]

Once instantiated, I get the following - note the container names:

root@29eba2b:~# balena ps
CONTAINER ID   IMAGE                                                            COMMAND                  CREATED          STATUS                    PORTS     NAMES
585d8a54ef84   dc8eeceb4a7c                                                     "tail -f /dev/null"      19 minutes ago   Up 19 minutes                       zookeeperXXX63_5840953_2408105_d8854ebebe2abb92c44a3670a8289067
69193ff9c140   dc8eeceb4a7c                                                     "tail -f /dev/null"      19 minutes ago   Up 19 minutes                       test1_5840950_2408105_d8854ebebe2abb92c44a3670a8289067
ba613f04fd15   dc8eeceb4a7c                                                     "tail -f /dev/null"      19 minutes ago   Up 19 minutes                       zookeeperXX62_5840954_2408105_d8854ebebe2abb92c44a3670a8289067
204f3c7f83d7   dc8eeceb4a7c                                                     "tail -f /dev/null"      19 minutes ago   Up 19 minutes                       test2_5840951_2408105_d8854ebebe2abb92c44a3670a8289067
5bf3e162b263   dc8eeceb4a7c                                                     "tail -f /dev/null"      19 minutes ago   Up 19 minutes                       zookeeperXXXX64_5840952_2408105_d8854ebebe2abb92c44a3670a8289067
a094162a5ac5   registry2.balena-cloud.com/v2/13623e9b65aa57a645205212c74f85ff   "/usr/src/app/entry.…"   37 minutes ago   Up 37 minutes (healthy)             balena_supervisor

where:

zookeeperXX62_5840954_2408105_d8854ebebe2abb92c44a3670a8289067      62 chars
zookeeperXXX63_5840953_2408105_d8854ebebe2abb92c44a3670a8289067     63 chars
zookeeperXXXX64_5840952_2408105_d8854ebebe2abb92c44a3670a8289067    64 chars

If I ssh into the test container and ping the three containers, when pinging the ones that have container name length === 62 or === 63 I get no error, but when pinging the one with container name length === 64 it fails:

root@69193ff9c140:/# ping zookeeperXX62
PING zookeeperXX62 (172.17.0.6) 56(84) bytes of data.
64 bytes from zookeeperXX62_5840954_2408105_d8854ebebe2abb92c44a3670a8289067.ae8c6ddc272547a49531149bd2dd187f_default (172.17.0.6): icmp_seq=1 ttl=64 time=0.175 ms
64 bytes from zookeeperXX62_5840954_2408105_d8854ebebe2abb92c44a3670a8289067.ae8c6ddc272547a49531149bd2dd187f_default (172.17.0.6): icmp_seq=2 ttl=64 time=0.117 ms
64 bytes from zookeeperXX62_5840954_2408105_d8854ebebe2abb92c44a3670a8289067.ae8c6ddc272547a49531149bd2dd187f_default (172.17.0.6): icmp_seq=3 ttl=64 time=0.115 ms
^C
--- zookeeperXX62 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2080ms
rtt min/avg/max/mdev = 0.115/0.135/0.175/0.030 ms

root@69193ff9c140:/# ping zookeeperXXX63
PING zookeeperXXX63 (172.17.0.4) 56(84) bytes of data.
64 bytes from zookeeperXXX63_5840953_2408105_d8854ebebe2abb92c44a3670a8289067.ae8c6ddc272547a49531149bd2dd187f_default (172.17.0.4): icmp_seq=1 ttl=64 time=0.167 ms
64 bytes from zookeeperXXX63_5840953_2408105_d8854ebebe2abb92c44a3670a8289067.ae8c6ddc272547a49531149bd2dd187f_default (172.17.0.4): icmp_seq=2 ttl=64 time=0.137 ms
^C
--- zookeeperXXX63 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1023ms
rtt min/avg/max/mdev = 0.137/0.152/0.167/0.015 ms

root@69193ff9c140:/# ping zookeeperXXXX64
PING zookeeperXXXX64 (172.17.0.3) 56(84) bytes of data.
^C64 bytes from 172.17.0.3: icmp_seq=1 ttl=64 time=0.138 ms

--- zookeeperXXXX64 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.138/0.138/0.138/0.000 ms

and if you check the last entries in journalctl, or tail it while trying ping zookeeperXXXX64, you see

Dec 12 22:41:13 29eba2b balenad[1343]: time="2022-12-12T22:41:13.242400385Z" level=error msg="[resolver] error writing resolver resp, dns: bad rdata"

A reference to this error was found by balena user hesch on https://forums.docker.com/t/reverse-dns-fails-dns-bad-rdata-if-container-name-is-larger-than-62-characters/107330

Based on that, I found on this docker guide Container name configured using --name is used to discover a container within an user-defined docker network. The embedded DNS server maintains the mapping between the container name and its IP address (on the network the container is connected to).

From there: 1) I was able to workaround the issue by renaming the container to a shorter name. See https://jel.ly.fish/f9dc24eb-389d-4e7a-905f-b33d6e4c2b5e?event=1ac9f3c9-5e17-460a-a001-3042e25df423 2) The user found that "I checked why your commit UUID is shorter than mine and discovered that commit UUIDs from git push are longer than from balena push.". That explained why the issue was hard to reproduce locally, because when I executed balena push the container name was shorter than 64 characters:

4164398d9f1f425c4a96546f2ba89e66 #(balena push)
72b853b44d0246fd789c89269db742d04a2e5ef9 #(git push)

The difference is 8 chars

@cywang117 In your docker-compose, I think that what's happening is that if you balena push it the container name will be shorter than 64 bytes, same as happened on my setup. Can you do a balena ps on your test device? If you make the container name longes, say from 'zookeeper' to 'zookeeper123456789', what happens?

hs-neax commented 1 year ago

Hi I'm hesch from the support thread in Balena forums. I completely support what @ramirogm wrote, it's not related to the ubi8 image, that was just a coincidence which led me to the wrong assumption at the beginning (the ubi8 containers had longer container names by coincidence).

@pipex I understand that limiting the container name to 63 chars will cut of randomly parts that you might need in other places of the code. From a short glance at the code I assume you use the commit id to distinguish different container commit versions. For that I'd suppose that also a shortened commit ID should provide safe collision avoidance. Something like that: ${serviceNamePart}_${imageIdPart}_${releaseIdPart}_${commitPrefix.substring(0, 8)} That does not prevent the issue in all cases but makes it much less likely to appear. Just an idea that came to my mind without in depth knowledge of the supervisor code. Changing the names the Docker/Balena DNS resolves so that it doesn't use the container name but only the alias sounds good to me but I don't know about ease of implementation and side effects.

pipex commented 1 year ago

For that I'd suppose that also a shortened commit ID should provide safe collision avoidance. Something like that:

@hs-neax unfortunately is not just collision avoidance on the supervisor code. The release uuid / commit is an identifier used when reporting information to the API, which means that we would also need to modify the API to accept the shortened commit. Removing the imageId and releaseId is in our roadmap, which would also solve the issue, and although that also requires API side changes, it has a bunch of other benefits that make the change more interesting.

Changing the names the Docker/Balena DNS resolves so that it doesn't use the container name but only the alias sounds good to me but I don't know about ease of implementation and side effects.

Yeah, I took a quick glance and I haven't found a way to do it via docker options, so that might be a dead end

cywang117 commented 1 year ago

@ramirogm From that guide, in the table under the --name row, there's the following for aliases:

In addition to --name as described above, a container is discovered by one or more of its configured --network-alias (or --alias in docker network connect command) within the user-defined network. The embedded DNS server maintains the mapping between all of the container aliases and its IP address on a specific user-defined network. A container can have different aliases in different networks by using the --alias option in docker network connect command.

As services are created with aliases too, it sounds like this would work as well.

I modified the docker-compose above to run with Supervisor in local mode, where service names are tailed with _1_1_localrelease or similar:

version: "2.1"

services:
  test1:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  test2:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  zookeeperXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX64:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  zookeeperXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX63:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  zookeeperXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX62:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]

When pinging the longest name from test2, indeed, balena journal logs show the error writing resolver resp, dns: bad rdata, but I waited a bit longer and pings started going through. It looks like it takes some time:

root@nuccc:~# balena exec test2_2_1_localrelease ping zookeeperXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX64
PING zookeeperXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX64 (172.22.0.4) 56(84) bytes of data.
64 bytes from 172.22.0.4: icmp_seq=1 ttl=64 time=0.043 ms
64 bytes from 172.22.0.4: icmp_seq=2 ttl=64 time=0.136 ms
64 bytes from 172.22.0.4: icmp_seq=3 ttl=64 time=0.081 ms

Try waiting for a bit after pinging the service with the long name, and see if the behavior is the same for you.

I closed this issue before because I didn't think it was related to container names; I still think it's not, because aliases according to the docker guide are also known to the DNS resolver. Perhaps it's the order of resolution that causes the error logs in the journal -- when pinging by service name, the resolver will always first resolve by full container name, and that's where the error appears. Thoughts?

ramirogm commented 1 year ago

Hi @cywang117

From what we found, what fails is the reverse name lookup. In the following example, I ran ping, nslookup and dig.

All three fail to perform the reverse DNS lookup, with the error showing up in journalctl output. But they fail in different forms: ping gives up and prints the IP address, just as if you run ping -n. dig and nslookup timeout and fail since they can't really provide an answer.

You can compare a successful ping, where the response prints the hostname, with a failed one that just prints the IP address.

root@e404e104187f:/# nslookup -x 172.17.0.3
;; communications error to 127.0.0.11#53: end of file
;; communications error to 127.0.0.11#53: end of file

root@e404e104187f:/# ping zookeeperXXXX64
PING zookeeperXXXX64 (172.17.0.3) 56(84) bytes of data.
64 bytes from 172.17.0.3: icmp_seq=1 ttl=64 time=0.097 ms
[....]
64 bytes from 172.17.0.3: icmp_seq=15 ttl=64 time=0.127 ms
^C
--- zookeeperXXXX64 ping statistics ---
15 packets transmitted, 15 received, 0% packet loss, time 43330ms
rtt min/avg/max/mdev = 0.063/0.123/0.158/0.021 ms
root@e404e104187f:/# 

root@e404e104187f:/# dig -x 172.17.0.3

; <<>> DiG 9.11.3-1ubuntu1.18-Ubuntu <<>> -x 172.17.0.3
;; global options: +cmd
;; connection timed out; no servers could be reached

Dec 13 17:54:22 29eba2b balenad[1343]: time="2022-12-13T17:54:22.739330603Z" level=error msg="[resolver] error writing resolver resp, dns: bad rdata"
cywang117 commented 1 year ago

I see, thanks for clarifying @ramirogm. @pipex @kb2ma , Thinking of solutions here, could we perhaps name a service by its docker-compose service name first, and alias it with the full serviceName_imageId_releaseId_commit, so that the service name gets used first when resolving? This also has the benefit of more easily being able to balena exec when debugging Supervisor-managed containers.

pipex commented 1 year ago

We use the container name to figure out the current state of the app install, getting that info from the aliases might not be as easy. Also the service name avoids collisions, when moving between releases where you can have an old and a new copy of the service. For multi-app, we will definitely want to avoid collisions between similarly named services in different apps.

I took a quick look at removing imageId and serviceId from the container name. The imageId could be done without server side changes, as it is really not used by the backend, the main limitation is that we return the imageId in some supervisor API methods. We might be able to circumvent that if we get the imageId for a given service from the target state of the app on the database, but if the database gets corrupted/deleted we won't have information to return. If the target state is a new release we won't have any information on the imageId either.

The serviceId cannot be changed without server side changes, we would need to add a serviceName column in redis, and modify the UI to use the new serviceName field.

These are not small changes but it would be great to get rid of these database dependent ids and it would help with this issue here

pipex commented 1 year ago

That being said, I also like the idea of compressing release uuids. We would need to modify the target state patch to work with the shortened UUIDs, but it would save a few bytes of bandwidth per PATCH as well, which is nice

pipex commented 1 year ago

One more thing. For any changes to the container name, we also need to modify the way services are accessed via the web terminal

hs-neax commented 1 year ago

@pipex Thank you very much for the detailed feedback. Since there's apparently no quick fix available I tested some workarounds. Since our software stack uses kafka which heavily depends on network communication between containers this bug actually bricks our stack completely. The two production grade workarounds I came up with are:

Use Balena push Potentially the easiest option. I'm not sure if mixing git push and balena push releases in one fleet is advisable but it seems to work. Downside for us is loosing the enforced connection to our git repo (with git push you can easily look up the git commit for an old release).

Use short container names and network aliases We can get around the issue by using short container names, obviously. To avoid changing the names everywhere in the code (they should be an env var but who knows) I tested network aliases. Here comes the confusing part for me. In the documentation it is stated:

networks | Only support specifying network names

However in this thread a merged PR is mentioned that implements network aliases. I tested it and seems like the following docker-compose.yml does work:

version: "2.4"

services:
  test1:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  test2:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
  test3:
    image: balenalib/intel-nuc-ubuntu:bionic
    entrypoint: ["tail", "-f", "/dev/null"]
    networks:
      default:
        aliases:
          - test3_long_name

Seems like the documentation needs to be updated here. Can someone confirm that?