Docker Swarm Overlay Networking Issue on vSphere (New Swarms since 36.20221030.3.0)

fifofonix commented 1 year ago

Describe the bug

Containers running as part of a service cannot communicate with each other across nodes on an overlay network on newly commissioned docker swarms on vSphere using a FCOS 37+ image.

This issue does not affect nodes provisioned on AWS.

This issue does not affect newly created FCOS 36 vSphere-based nodes, nor FCOS36 nodes that have auto-upgraded to FCOS 37. This remains true even if the swarm is entirely destroyed and re-created.

Reproduction steps

Create two new FCOS37 nodes on vSphere with minimal ignition (in our environment this does mean corp proxies defined for zincati/docker/rpm-ostree).
Create a two node docker swarm (node1: docker swarm init, node2: <output from docker swarm init command>)
From node 1 create an overlay network docker network create -d overlay test_network
Create an nginx service on this network docker service create --network test_network --replicas 4 nginx
This should create two running nginx containers on each of the two nodes.
From one container on node 1 confirm you can curl the other container on node 1 successfully with nginx welcome text. docker exec -it <node1-container-1-id> curl <node1-container-2-id>. Succeeds.
From one container on node 1 attempt to curl a container on node 2 to see nginx welcome text. docker exec -it <node1-container-1-id> curl <node2-container-1-id>. Hangs and then times out. Attempting from node 2 to node 1 fails also.

Expected behavior

In step 7 we expect the same result as that from step 6.

We are deploying via terraform. If we change one thing and specify an F36 OVA, e.g. 36.20220906 then all steps above will succeed.

Furthermore, if the nodes are allowed to upgrade to the latest F37 then all tests continue to succeed. Deleting the swarm, and repeating re-creation also yields successful steps 1-7.

Actual behavior

Step 7 hangs eventually timing out.

Note, that DNS resolution is fine. The container ID is resolved to the correct IP address on the other node. Installing traceroute in the containers shows that it is not possible to find a route.

System details

The issue is with vSphere / OVA.
The issue could not be replicated on AWS.

Ignition config

This ignition has been manually edited with some values expunged.

{
    "ignition": {
        "config": {
            "replace": {
                "verification": {}
            }
        },
        "proxy": {},
        "security": {
            "tls": {}
        },
        "timeouts": {},
        "version": "3.3.0"
    },
    "kernelArguments": {},
    "passwd": {
        "users": [
            {
                "groups": [
                    "sudo",
                    "docker"
                ],
                "name": "core"
            },
        ]
    },
    "storage": {
        "files": [
            {
                "group": {},
                "overwrite": true,
                "path": "/etc/hostname",
                "user": {},
                "contents": {
                    "source": "data:,node-name",
                    "verification": {}
                },
                "mode": 272
            },
            {
                "group": {},
                "overwrite": true,
                "path": "/etc/ssh/sshd_config.d/05-acme.conf",
                "user": {},
                "contents": {
                    "source": "data:,TrustedUserCAKeys%20%2Fetc%2Fssh%2Ftrusted-user-ca-keys.pem%0AHostCertificate%20%2Fetc%2Fssh%2Fssh_host_ecdsa_key-cert.pub%0A",
                    "verification": {}
                },
                "mode": 420
            },
            {
                "group": {},
                "path": "/etc/acme-proxy.env",
                "user": {},
                "contents": {
                    "compression": "gzip",
                    "source": "data:;base64,***=",
                    "verification": {}
                },
                "mode": 420
            },
            {
                "group": {},
                "overwrite": true,
                "path": "/etc/sysconfig/docker",
                "user": {},
                "contents": {
                    "source": "data:,OPTIONS%3D%22--selinux-enabled%0A--default-ulimit%20nofile%3D64000%3A64000%0A--init-path%20%2Fusr%2Flibexec%2Fdocker%2Fdocker-init%0A--userland-proxy-path%20%2Fusr%2Flibexec%2Fdocker%2Fdocker-proxy%0A%22%0A",
                    "verification": {}
                },
                "mode": 420
            },
            {
                "group": {},
                "overwrite": true,
                "path": "/etc/ssh/trusted-user-ca-keys.pem",
                "user": {},
                "contents": {
                    "source": "data:,XXXX",
                    "verification": {}
                },
                "mode": 416
            }
        ]
    },
    "systemd": {
        "units": [
            {
                "dropins": [
                    {
                        "contents": "[Service]\nEnvironmentFile=/etc/acme-proxy.env\n",
                        "name": "99-proxy.conf"
                    }
                ],
                "name": "rpm-ostreed.service"
            },
            {
                "dropins": [
                    {
                        "contents": "[Service]\n# Next line can be used to increase verbosity of logging -vv or -vvv for TRACE logging...\nEnvironment=ZINCATI_VERBOSITY=\"-v\"\n# Next line redundant if we also use /etc/systemd/system.conf\nEnvironmentFile=/etc/acme-proxy.env\nExecStartPre=echo Proxies used are... http_proxy=$http_proxy, https_proxy=$https_proxy\n",
                        "name": "99-http-proxy.conf"
                    }
                ],
                "enabled": false,
                "name": "zincati.service"
            },
            {
                "dropins": [
                    {
                        "contents": "[Service]\nEnvironmentFile=/etc/acme-proxy.env\n",
                        "name": "99-http-proxy.conf"
                    }
                ],
                "enabled": true,
                "name": "docker.service"
            }
        ]
    }
}

Additional information

No SELinux denials. No journal logs at all when the failing curl is made. There are docker unit errors on overlay network creation VXLAN errors but these are the same for F36 and F37 and so do not seem to be relevant. Nothing obvious (to me) in journals.

travier commented 1 year ago

If you have a working deployment and a broken one, you compare the sha256sums of the files listed by ostree admin config-diff. Something like (untested):

$ sudo -i
$ cd /etc
$ for f in $(ostree admin config-diff); do sha256sum $f ; done

That should narrow things down.

fifofonix commented 1 year ago

I will capture what you're asking for but to re-iterate the issue I'm describing occurs purely by changing the starting FCOS version, and from a FCOS version from my tests performed to date I know that:

Servers provisioned with the following have no issues:

https://builds.coreos.fedoraproject.org/prod/streams/next/builds/36.20220906.1.0/x86_64/fedora-coreos-36.20220906.1.0-vmware.x86_64.ova"

That version would naturally update via zincati to the following F37 version. If I provision a new system with this F37 version the issue manifests. However, as described in the bug the issue does not manifest if I let the F36 system upgrade to this point.

https://builds.coreos.fedoraproject.org/prod/streams/next/builds/37.20221111.1.0/x86_64/fedora-coreos-37.20221111.1.0-vmware.x86_64.ova"

I think what you would be most interested in is the SHA diff of the non-working freshly provisioned F37, vs the working F37 (that has upgraded from F36)?

dustymabe commented 1 year ago

I think what you would be most interested in is the SHA diff of the non-working freshly provisioned F37, vs the working F37 (that has upgraded from F36)?

correct. I think that's what he is asking for.

Another thing that would be interesting to me is if you iterated over the history to find the last working starting point. i.e. is 36.20220906.1.0 the newest starting point that works or is there another newer starting point that works? Having this info would then let us pinpoint the first starting point that doesn't work and we could analyze the diffs (not just package diffs but also fedora-coreos-config and coreos-assembler diffs) between those two versions.

fifofonix commented 1 year ago

Unfortunately, this kind of testing is time consuming (loading the OVF into vSphere apparently). To me it seems to be the F36 to F37 transition that is the issue. How would I know what the last F36 is and the first 'good' F37 is? (Not being an expert in cincinnati trees etc)

travier commented 1 year ago

The download page (https://getfedora.org/en/coreos?stream=stable ) has the latest F36 & first F37 listed.

travier commented 1 year ago

I think what you would be most interested in is the SHA diff of the non-working freshly provisioned F37, vs the working F37 (that has upgraded from F36)?

correct. I think that's what he is asking for.

Yes, that's the idea.

fifofonix commented 1 year ago

This is the SHA diff across non-working newly provisioned 37.20221111.1.0, and working newly provisioned F36 36.20220906.1.0 after it has upgraded to 37.20221111.1.0.

For the record in this test run I delayed creating the docker swarm on the F36 upgraded to F37 nodes until they had upgraded to F37. This yielded the same result, i.e. that the swarm functioned fine.

1d0
< 405d7fa9638bd5f90d53855ac5a60fe0967c969e68485c0c8c2d5b268d3342d2  shadow
3d1
< 8ace78bb009f6ca5818dacd0e17887e9110972a14e3bdd02aea3f3109515b1e0  gshadow
5d2
< bbcb28b8d9aacfdf4662412b1149b9b31007142a93bd84f5aa2138b26e7de672  selinux/targeted/active/commit_num
7a5,7
> bbcb28b8d9aacfdf4662412b1149b9b31007142a93bd84f5aa2138b26e7de672  selinux/targeted/active/commit_num
> 4ef830580eaa4aa0f00f1b4880ab19ca033abd28404498ab62426b9bb4946c8b  gshadow
> 166077a16589114aa2a538903ef198c70d0faba488d9733588da9d1f0dc7865a  shadow
9,16c9,12
< 12b868b92fb12584298f7585e4316ba5093cd224d9e541ae12c004b2219571cf  ssh/sshd_config.d/05-columbia.conf
< 725c32558f455715e0ae15a3a03e167bbd26b0a2201b97cd8a2735ed00d22f4a  ssh/trusted-user-ca-keys.pem
< 0b447524b7f929fc56a1fd35e1feffecdab42357081b00e95a51b4785c6e979a  ssh/ssh_host_ed25519_key
< dbc416cf283e41d79a8a80e4e0a0022d587326874ff1f8b73ec1b983620182cd  ssh/ssh_host_ed25519_key.pub
< 62f2482376c5afe25695964d03be6b7fb5ab3e1a6d8b448e6edab38e3f439fe3  ssh/ssh_host_ecdsa_key
< 59d379e39d447b368c0610b6bd2ebc964b214efb2eaf0140f182c980462ec073  ssh/ssh_host_ecdsa_key.pub
< a4a0bef8447fff00f524b3209354e300b7513c56f026ab0ef5c3f71f61579797  ssh/ssh_host_rsa_key
< e0bc289d4b882b45748dc73b8017d2875a142ddb013cd37cf76bf479e65aeb50  ssh/ssh_host_rsa_key.pub
---
> 18409e29ec0eb01f14d501459bfa08ec77f8380e404d81c4fd227b766c8fd126  issue.d/21_clhm_ssh_host_keys.issue
> e845f05b4d4857f30a2cb991aaee338c2ab77d770dfa8588f1969e813fbf9975  issue.d/30_coreos_ignition_provisioning.issue
> aa7b458c6bcbf9b3a278751ba25ad2790ef8ada364216e167327fbd6854ee8a4  issue.d/22_clhm_ens192.issue
> 500252d588d2f0388fb479e5d5fd7366ec1bd9c27c1a41d161cf0ddeb64f046d  issue.d/30_ssh_authorized_keys.issue
19,22c15,22
< 500252d588d2f0388fb479e5d5fd7366ec1bd9c27c1a41d161cf0ddeb64f046d  issue.d/30_ssh_authorized_keys.issue
< a33c9266a28ebb9df5286e03d59e627c00b8730a20f57d8069b758b5724d3545  issue.d/30_coreos_ignition_provisioning.issue
< aa7b458c6bcbf9b3a278751ba25ad2790ef8ada364216e167327fbd6854ee8a4  issue.d/22_clhm_ens192.issue
< 24a2d09033776f29230dfb1d8db1dc9ed5373fa0d889a15d426f619519a471d0  issue.d/21_clhm_ssh_host_keys.issue
---
> 12b868b92fb12584298f7585e4316ba5093cd224d9e541ae12c004b2219571cf  ssh/sshd_config.d/05-columbia.conf
> 725c32558f455715e0ae15a3a03e167bbd26b0a2201b97cd8a2735ed00d22f4a  ssh/trusted-user-ca-keys.pem
> b0e3c78ec8530cb50fba2275415ca64b2fd395e672a7687da04bdcd5186b7501  ssh/ssh_host_ecdsa_key
> 2aeefb79ec4dadaec3b995ec794c85e65cafdd775ce5041ba8c729dea5f76c9d  ssh/ssh_host_ecdsa_key.pub
> 16bd482ea818130d4ef5ffc687bea1c295d377af4112888173d3a47ae6d4e14d  ssh/ssh_host_ed25519_key
> 9c6989c7d9c298945f4aebb1de7bbde1644ad0f257c0c85372956addef9b8ecf  ssh/ssh_host_ed25519_key.pub
> 61f16585a53f3de6ad6c5bf35daf2b4be2d39f12adae2008f1fc8fafe1d6ed04  ssh/ssh_host_rsa_key
> 00342ec96e8795ff2d5610ab5ec6cd717d0061b3a7fbb9374553300cd84afd3e  ssh/ssh_host_rsa_key.pub
28a29
> 6623a0b9a50f5d2667493c607f90dfc35c6724336edeac025b78c2272aa5275a  docker/key.json
31,33c32,34
< 92a0cf6eb61d190ea096489fd09c0563668ebbb5b6ca914e42ef84632abadde3  locale.conf
< 2c24f906bd707982c8b4641356b7ecdbad91416bcc4be4efecc0650648d4c64a  .ignition-result.json
< f5a1ab9f2e2037ef09207155dee9a3a66478017f1e51de28df0eeda3b127c5f4  machine-id
---
> f37bdaf5772c0dfa5f5a6b5a09db5f4d883865e75a7488a67fd43d8ab57b793d  .updated
> 6477b5f8323e29513bc22507fe87aefa39c5cb1a06da57e58300b1fc3f4a9325  .ignition-result.json
> 99e096e993f971f74c6c4afdcb916f4c19d9a8e55e7fd7a7a92ec6f301ee7658  machine-id
36c37,38
< 508268ffb93137cdb4495bfce92f7326e9a9449f8938ae747e3631001da1f060  shadow-
---
> 92a0cf6eb61d190ea096489fd09c0563668ebbb5b6ca914e42ef84632abadde3  locale.conf
> 8b0639e910dce777a0701c0485aaaca12d79ae126dc9e51964d144b51e7cdaa1  shadow-
38c40
< b4f674e3a5c71cb704819fe620140053f47eebed14c34a30f1308a71c5da9992  gshadow-
---
> 50c4cf5a5a5eec488d489c5ced894ec36e1ca7ede930f39c83bd96e33ea0772a  gshadow-
41,42d42
< f37bdaf5772c0dfa5f5a6b5a09db5f4d883865e75a7488a67fd43d8ab57b793d  .updated
< e7cc4cb080e1a655361436f7b102664e71674860c4210ba45c3a85e5d572cacd  docker/key.json

dustymabe commented 1 year ago

Unfortunately, this kind of testing is time consuming (loading the OVF into vSphere apparently). To me it seems to be the F36 to F37 transition that is the issue. How would I know what the last F36 is and the first 'good' F37 is? (Not being an expert in cincinnati trees etc)

If I were doing this I'd probably just look at the history in the unofficial builds browser and bisect between 36.20220906.1.0 and latest. It's definitely time consuming (unfortunately), but would help the investigation along. I'm definitely not an expert on swarm or container networking so bisecting and looking at diffs is the best I've got to offer.

fifofonix commented 1 year ago

Before I start down that avenue, any reaction to the diffs above? The good news is it is fairly short, and I think you'd expect the differences for many of the files listed? Anything that you want more info on?

dustymabe commented 1 year ago

I'm having trouble parsing the output in https://github.com/coreos/fedora-coreos-tracker/issues/1372#issuecomment-1375885311. Here's a new command to run:

sudo -i
cd /etc
for f in $(ostree admin config-diff | cut -d " " -f 5 | sort); do sha256sum $f ; done

Run this separately on the good node and on the bad node and post each of those outputs (you can add a .txt attachment here).

fifofonix commented 1 year ago

I basically ran that before. Here are the two files it produced: upgraded.f37.sha.diff.txt new.f37.sha.diff.txt

What I posted previously is the diff of these two files to eliminate common files with identical SHAs. To do that properly though I should have probably re-sorted by field two before diffing however.

fifofonix commented 1 year ago

Following your suggestions I tried to narrow down to the problematic release on the stable stream and to my surprise it is pre-f37:

Working: 36.20221014.3.1
Non-working: 36.20221030.3.0

I'm going to gather the files you suggested above for these versions tomorrow.

dustymabe commented 1 year ago

Following your suggestions I tried to narrow down to the problematic release on the stable stream and to my surprise it is pre-f37:

Working: 36.20221014.3.1

Non-working: 36.20221030.3.0

And to be clear.. both of these work when initially provisioned, but after fully updating (all the way to F37) the one that started at 36.20221030.3.0 no longer works?

The f-c-c diff between those two versions is: https://github.com/coreos/fedora-coreos-config/compare/340bc23af03163d8569fc5cee9667f051c9e0025...59530d10327c0dc975857d120af7d72e30b22626

The COSA diff between those two versions is: https://github.com/coreos/coreos-assembler/compare/89f06f542dc1e9cdeae0d8dfb1a8b46e7da4adba...e8676668f7c1718e982a2081f3ac4b8d15590834

The package diff between those two versions is:

$ rpm-ostree --repo=./ db diff e75cd529cfc15329d9b1cb80b1fc83f8af3a70029b015da2b8a8d7c17bac9b3c eab21e5b533407b67b1751ba64d83c809d076edffa1ff002334603bf13655a14
ostree diff commit from: e75cd529cfc15329d9b1cb80b1fc83f8af3a70029b015da2b8a8d7c17bac9b3c
ostree diff commit to:   eab21e5b533407b67b1751ba64d83c809d076edffa1ff002334603bf13655a14
Upgraded:
  NetworkManager 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
  NetworkManager-cloud-setup 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
  NetworkManager-libnm 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
  NetworkManager-team 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
  NetworkManager-tui 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
  aardvark-dns 1.1.0-1.fc36 -> 1.2.0-6.fc36
  amd-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36
  bash 5.1.16-3.fc36 -> 5.2.2-2.fc36
  btrfs-progs 5.18-1.fc36 -> 6.0-1.fc36
  chrony 4.2-5.fc36 -> 4.3-1.fc36
  containers-common 4:1-59.fc36 -> 4:1-62.fc36
  coreos-installer 0.16.0-1.fc36 -> 0.16.1-2.fc36
  coreos-installer-bootinfra 0.16.0-1.fc36 -> 0.16.1-2.fc36
  ethtool 2:5.19-1.fc36 -> 2:6.0-1.fc36
  fedora-release-common 36-18 -> 36-20
  fedora-release-coreos 36-18 -> 36-20
  fedora-release-identity-coreos 36-18 -> 36-20
  git-core 2.37.3-1.fc36 -> 2.38.1-1.fc36
  glibc 2.35-17.fc36 -> 2.35-20.fc36
  glibc-common 2.35-17.fc36 -> 2.35-20.fc36
  glibc-minimal-langpack 2.35-17.fc36 -> 2.35-20.fc36
  gnutls 3.7.7-1.fc36 -> 3.7.8-2.fc36
  grub2-common 1:2.06-53.fc36 -> 1:2.06-54.fc36
  grub2-efi-x64 1:2.06-53.fc36 -> 1:2.06-54.fc36
  grub2-pc 1:2.06-53.fc36 -> 1:2.06-54.fc36
  grub2-pc-modules 1:2.06-53.fc36 -> 1:2.06-54.fc36
  grub2-tools 1:2.06-53.fc36 -> 1:2.06-54.fc36
  grub2-tools-minimal 1:2.06-53.fc36 -> 1:2.06-54.fc36
  intel-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36
  kernel 5.19.15-201.fc36 -> 6.0.5-200.fc36
  kernel-core 5.19.15-201.fc36 -> 6.0.5-200.fc36
  kernel-modules 5.19.15-201.fc36 -> 6.0.5-200.fc36
  libidn2 2.3.3-1.fc36 -> 2.3.4-1.fc36
  libksba 1.6.0-3.fc36 -> 1.6.2-1.fc36
  libmaxminddb 1.6.0-2.fc36 -> 1.7.1-1.fc36
  libsmbclient 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
  libwbclient 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
  libxml2 2.9.14-1.fc36 -> 2.10.3-2.fc36
  linux-firmware 20220913-140.fc36 -> 20221012-141.fc36
  linux-firmware-whence 20220913-140.fc36 -> 20221012-141.fc36
  moby-engine 20.10.18-1.fc36 -> 20.10.20-1.fc36
  netavark 1.1.0-1.fc36 -> 1.2.0-5.fc36
  nvidia-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36
  podman 4:4.2.1-2.fc36 -> 4:4.3.0-2.fc36
  podman-plugins 4:4.2.1-2.fc36 -> 4:4.3.0-2.fc36
  rpm-ostree 2022.13-1.fc36 -> 2022.14-1.fc36
  rpm-ostree-libs 2022.13-1.fc36 -> 2022.14-1.fc36
  rsync 3.2.6-1.fc36 -> 3.2.7-1.fc36
  runc 2:1.1.3-1.fc36 -> 2:1.1.4-1.fc36
  samba-client-libs 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
  samba-common 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
  samba-common-libs 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
  ssh-key-dir 0.1.3-2.fc36 -> 0.1.4-1.fc36
  tzdata 2022d-1.fc36 -> 2022e-1.fc36
  vim-data 2:9.0.720-1.fc36 -> 2:9.0.803-1.fc36
  vim-minimal 2:9.0.720-1.fc36 -> 2:9.0.803-1.fc36
Added:
  containers-common-extra-4:1-62.fc36.noarch

dustymabe commented 1 year ago

I basically ran that before. Here are the two files it produced: upgraded.f37.sha.diff.txt new.f37.sha.diff.txt

What I posted previously is the diff of these two files to eliminate common files with identical SHAs. To do that properly though I should have probably re-sorted by field two before diffing however.

Looking at those two files it appears these are the files that are different (different SHA-256) between the two:

shadow                                       
gshadow                                      
ssh/ssh_host_ed25519_key                     
ssh/ssh_host_ed25519_key.pub                 
ssh/ssh_host_ecdsa_key                       
ssh/ssh_host_ecdsa_key.pub                   
ssh/ssh_host_rsa_key                         
ssh/ssh_host_rsa_key.pub                     
issue.d/30_coreos_ignition_provisioning.issue
issue.d/21_clhm_ssh_host_keys.issue          
.ignition-result.json                        
machine-id                                   
shadow-                                      
gshadow-                                     
docker/key.json

I don't think there is anything surprising in there and likely nothing that would explain the described difference in behavior.

travier commented 1 year ago

I'm out of idea and I would recommend asking in the Moby/Docker issue tracker for help.

fifofonix commented 1 year ago

Following your suggestions I tried to narrow down to the problematic release on the stable stream and to my surprise it is pre-f37:

Working: 36.20221014.3.1

Non-working: 36.20221030.3.0

I'm going to gather the files you suggested above for these versions tomorrow.

Sorry if I wasn't clear earlier @dustymabe but I'm now saying that this issue manifests prior to F37 so that is new news and probably warrants changing the issue title.

Its as simple as you can create a swarm with overlay network with cross-node communicating containers on the first FCOS version listed, but you can't on the second and all future versions (up until today's current stable).

There is the slight twist that you can have a working setup if you start from the first listed FCOS version and upgrade to F37 but in terms of focus I'd say the interesting question is why a new swarm on the former works but it doesn't on the latter.

travier commented 1 year ago

Where is the Swarm configuration stored? How does Swarm actually work?

fifofonix commented 1 year ago

Note that the first non-working FCOS version is the point at which the upgrade to docker 20.10.20 occurs. So, your suggestion @travier to follow up there may be a good idea. I'm going to change issue description to pinpoint on 20.10.20.

fifofonix commented 1 year ago

This thread reports similar behaviour: https://github.com/moby/moby/issues/41775

For the record the vSphere environment this issue is manifesting in is: 7.0.3.01100

And, VMWare VM version for the non-working 36.20221030.3.0 embedded in the OVF is 17.

fifofonix commented 1 year ago

Note, this issue does not manifest using the OVAs on VMWare Fusion 13.0.0 further pointing to a VMWare vSphere specific networking type incompatibility.

dustymabe commented 1 year ago

According to the release notes:

Screenshot 2023-01-10 at 11-55-05 Get Fedora

So this is likely fallout from:

https://github.com/coreos/fedora-coreos-tracker/issues/1141

dustymabe commented 1 year ago

You can verify that by trying out the downgrade instructions.

fifofonix commented 1 year ago

Downgrading fedora-coreos-37.20230110.1.0-vmware.x86_64.ova to vmx machine 13 fixes the problem. Wow! Sorry it has taken so long to confirm this.

dustymabe commented 1 year ago

Unfortunately I don't have much insight into why that change would have caused this problem. Does anyone with VMWare expertise know?

fifofonix commented 1 year ago

I experimentally upgraded to vmx machine 19 hoping that there might have been a bug fix that addressed this. Nope.

bgilbert commented 1 year ago

Nothing in the hardware feature matrix seems immediately relevant, but I presume there are other hardware changes not listed.

fifofonix commented 1 year ago

Per: https://stackoverflow.com/questions/66251422/docker-swarm-overlay-network-icmp-works-but-not-anything-else

sudo ethtool -K <ens192 or whatever outbound interface> tx off executed on the nodes concerned fixes the issue.

This switch controls where tcp checksums are performed on the interface driver or not. Switching it off means the checks are performed on the host with a consequent cpu performance penalty but no loss of functionality.

An alternate solution mentioned (as yet untested) seems to indicate that switching from a vmxnet3 driver to a simple E1000E card emulator fixes the issue implying this is an issue with the VmxNet3 interface driver.

Other potential driver options are listed here but more importantly this page seems to confirm that the driver version is a function of the vmware machine ID and the guest OS. This would explain why all might work fine for vmware machine ID 13 but not vmware machine ID 17 (or 19).

fifofonix commented 1 year ago

Confirmed that via vSphere UI replacing network interfaces with E1000-based cards fixes the problem so indeed seems to be specifically vmxnet3 driver related.

dustymabe commented 1 year ago

@fifofonix it would be good to get some of this information into the BZ you opened as well.

At this point I think we are in one of two cases here.

The interface driver check-summing in the kernel has never worked with hardware version 17 (unlikely).
The interface driver check-summing in the kernel worked at one point but a regression was introduced.

If we could pinpoint which version of the kernel the regression was introduced in then we could provide more information to upstream kernel maintainers. You can try out older kernels by rpm-ostree override replace http://path/to/kojiid them from the koji builds.

Either way we're going to have to find the proper people or list to send this information to. Maybe @jmflinuxtx can point us in the right direction.

fifofonix commented 1 year ago

The other issue with the BZ relates to an issue seen on already-provisioned 'old' vmware machine ID 13 machines upgrading to latest OS versions. In this instance it is newly provisioned 'new' vmware machine ID 17 machines experiencing issues. I guess it is possible that there is some vmware level networking issue common to both but I haven't seen the connection yet to say they are related.

As for this issue I was thinking this is not a FCOS issue at all but an issue with a vmware network card emulator (or whatever the right name is for this type of software). In my mental model I thought by making the tx off switch I was preventing the offload to a buggy vmxnet3 and having the guest OS do that instead of vmware. That would make more sense to me in the sense that we're only seeing this occur on vmware and with vmxnet (now that we have shown that by editing the machine to switch from vmxnet3 to E1000E fixes things?

As you say some experts on this would be great.

fifofonix commented 1 year ago

This remains an issue with 37.20230205.1.0 on vSphere 7.0 3j (ie. build 20990077)

Note that to toggle this switch via a NetworkManager connection file in /etc/NetworkManager/system-connections the syntax is:

...
[ethtool]
feature-tx-checksum-ip-generic=false
...

grantcurell commented 1 year ago

Per: https://stackoverflow.com/questions/66251422/docker-swarm-overlay-network-icmp-works-but-not-anything-else

sudo ethtool -K <ens192 or whatever outbound interface> tx off executed on the nodes concerned fixes the issue.

This switch controls where tcp checksums are performed on the interface driver or not. Switching it off means the checks are performed on the host with a consequent cpu performance penalty but no loss of functionality.

An alternate solution mentioned (as yet untested) seems to indicate that switching from a vmxnet3 driver to a simple E1000E card emulator fixes the issue implying this is an issue with the VmxNet3 interface driver.

Other potential driver options are listed here but more importantly this page seems to confirm that the driver version is a function of the vmware machine ID and the guest OS. This would explain why all might work fine for vmware machine ID 13 but not vmware machine ID 17 (or 19).

Confirmed on my setup that swapping to E1000 interface addresses the problem.

fifofonix commented 1 year ago

Confirmed that this issue still applies at 38.20230322.1.0 with vSphere 7.0.3.01200

fifofonix commented 1 year ago

Confirmed that this issue still applies at 38.20230414.1.0 (6.2.9-300.fc38.x86_64) with vSphere 7.0.3.01200

dustymabe commented 1 year ago

I think our status is still at https://github.com/coreos/fedora-coreos-tracker/issues/1372#issuecomment-1382433252

Basically we need to find relevant upstream people who can fix this in the driver(s) itself, right?

fifofonix commented 1 year ago

Agreed. I'm just periodically testing to see whether it gets fixed magically.

Nowheresly commented 1 year ago

Hi we were facing the same issue and we have found a workaround. When creating the swarm, we must specify a data-path-port different from the default value 4789 (see docs)....

So it means if we create the swarm using this command:

docker swarm init --data-path-port=38888

(38888 is just an example, you can set any value as long as it is not 4789)

then pb of invalid checksum disappears... and it's no more needed to deactivate checksum control in the network driver.

I have no idea why we have such different behavior with port 4789...

fifofonix commented 1 year ago

Interesting. I haven't tested your solution but I found this when googling it and it seems the root of the problem is a conflict with VMware NSX's communication port for VXLAN. This is good to know. Thanks!

gmchenyong commented 1 year ago

我esxi 7.0.2,安装centos8 centos7 debian12 也是这个问题,docker swarm 不能通过Overlay跨物理机访问端口.也就是你说的,docker exec -it curl ...谢谢你的问题和下面的回答,让我知道了解决办法,

coreos / fedora-coreos-tracker