Open fifofonix opened 1 year ago
If you have a working deployment and a broken one, you compare the sha256sums of the files listed by ostree admin config-diff
. Something like (untested):
$ sudo -i
$ cd /etc
$ for f in $(ostree admin config-diff); do sha256sum $f ; done
That should narrow things down.
I will capture what you're asking for but to re-iterate the issue I'm describing occurs purely by changing the starting FCOS version, and from a FCOS version from my tests performed to date I know that:
Servers provisioned with the following have no issues:
That version would naturally update via zincati to the following F37 version. If I provision a new system with this F37 version the issue manifests. However, as described in the bug the issue does not manifest if I let the F36 system upgrade to this point.
I think what you would be most interested in is the SHA diff of the non-working freshly provisioned F37, vs the working F37 (that has upgraded from F36)?
I think what you would be most interested in is the SHA diff of the non-working freshly provisioned F37, vs the working F37 (that has upgraded from F36)?
correct. I think that's what he is asking for.
Another thing that would be interesting to me is if you iterated over the history to find the last working starting point. i.e. is 36.20220906.1.0
the newest starting point that works or is there another newer starting point that works? Having this info would then let us pinpoint the first starting point that doesn't work and we could analyze the diffs (not just package diffs but also fedora-coreos-config and coreos-assembler diffs) between those two versions.
Unfortunately, this kind of testing is time consuming (loading the OVF into vSphere apparently). To me it seems to be the F36 to F37 transition that is the issue. How would I know what the last F36 is and the first 'good' F37 is? (Not being an expert in cincinnati trees etc)
The download page (https://getfedora.org/en/coreos?stream=stable ) has the latest F36 & first F37 listed.
I think what you would be most interested in is the SHA diff of the non-working freshly provisioned F37, vs the working F37 (that has upgraded from F36)?
correct. I think that's what he is asking for.
Yes, that's the idea.
This is the SHA diff across non-working newly provisioned 37.20221111.1.0, and working newly provisioned F36 36.20220906.1.0 after it has upgraded to 37.20221111.1.0.
For the record in this test run I delayed creating the docker swarm on the F36 upgraded to F37 nodes until they had upgraded to F37. This yielded the same result, i.e. that the swarm functioned fine.
1d0
< 405d7fa9638bd5f90d53855ac5a60fe0967c969e68485c0c8c2d5b268d3342d2 shadow
3d1
< 8ace78bb009f6ca5818dacd0e17887e9110972a14e3bdd02aea3f3109515b1e0 gshadow
5d2
< bbcb28b8d9aacfdf4662412b1149b9b31007142a93bd84f5aa2138b26e7de672 selinux/targeted/active/commit_num
7a5,7
> bbcb28b8d9aacfdf4662412b1149b9b31007142a93bd84f5aa2138b26e7de672 selinux/targeted/active/commit_num
> 4ef830580eaa4aa0f00f1b4880ab19ca033abd28404498ab62426b9bb4946c8b gshadow
> 166077a16589114aa2a538903ef198c70d0faba488d9733588da9d1f0dc7865a shadow
9,16c9,12
< 12b868b92fb12584298f7585e4316ba5093cd224d9e541ae12c004b2219571cf ssh/sshd_config.d/05-columbia.conf
< 725c32558f455715e0ae15a3a03e167bbd26b0a2201b97cd8a2735ed00d22f4a ssh/trusted-user-ca-keys.pem
< 0b447524b7f929fc56a1fd35e1feffecdab42357081b00e95a51b4785c6e979a ssh/ssh_host_ed25519_key
< dbc416cf283e41d79a8a80e4e0a0022d587326874ff1f8b73ec1b983620182cd ssh/ssh_host_ed25519_key.pub
< 62f2482376c5afe25695964d03be6b7fb5ab3e1a6d8b448e6edab38e3f439fe3 ssh/ssh_host_ecdsa_key
< 59d379e39d447b368c0610b6bd2ebc964b214efb2eaf0140f182c980462ec073 ssh/ssh_host_ecdsa_key.pub
< a4a0bef8447fff00f524b3209354e300b7513c56f026ab0ef5c3f71f61579797 ssh/ssh_host_rsa_key
< e0bc289d4b882b45748dc73b8017d2875a142ddb013cd37cf76bf479e65aeb50 ssh/ssh_host_rsa_key.pub
---
> 18409e29ec0eb01f14d501459bfa08ec77f8380e404d81c4fd227b766c8fd126 issue.d/21_clhm_ssh_host_keys.issue
> e845f05b4d4857f30a2cb991aaee338c2ab77d770dfa8588f1969e813fbf9975 issue.d/30_coreos_ignition_provisioning.issue
> aa7b458c6bcbf9b3a278751ba25ad2790ef8ada364216e167327fbd6854ee8a4 issue.d/22_clhm_ens192.issue
> 500252d588d2f0388fb479e5d5fd7366ec1bd9c27c1a41d161cf0ddeb64f046d issue.d/30_ssh_authorized_keys.issue
19,22c15,22
< 500252d588d2f0388fb479e5d5fd7366ec1bd9c27c1a41d161cf0ddeb64f046d issue.d/30_ssh_authorized_keys.issue
< a33c9266a28ebb9df5286e03d59e627c00b8730a20f57d8069b758b5724d3545 issue.d/30_coreos_ignition_provisioning.issue
< aa7b458c6bcbf9b3a278751ba25ad2790ef8ada364216e167327fbd6854ee8a4 issue.d/22_clhm_ens192.issue
< 24a2d09033776f29230dfb1d8db1dc9ed5373fa0d889a15d426f619519a471d0 issue.d/21_clhm_ssh_host_keys.issue
---
> 12b868b92fb12584298f7585e4316ba5093cd224d9e541ae12c004b2219571cf ssh/sshd_config.d/05-columbia.conf
> 725c32558f455715e0ae15a3a03e167bbd26b0a2201b97cd8a2735ed00d22f4a ssh/trusted-user-ca-keys.pem
> b0e3c78ec8530cb50fba2275415ca64b2fd395e672a7687da04bdcd5186b7501 ssh/ssh_host_ecdsa_key
> 2aeefb79ec4dadaec3b995ec794c85e65cafdd775ce5041ba8c729dea5f76c9d ssh/ssh_host_ecdsa_key.pub
> 16bd482ea818130d4ef5ffc687bea1c295d377af4112888173d3a47ae6d4e14d ssh/ssh_host_ed25519_key
> 9c6989c7d9c298945f4aebb1de7bbde1644ad0f257c0c85372956addef9b8ecf ssh/ssh_host_ed25519_key.pub
> 61f16585a53f3de6ad6c5bf35daf2b4be2d39f12adae2008f1fc8fafe1d6ed04 ssh/ssh_host_rsa_key
> 00342ec96e8795ff2d5610ab5ec6cd717d0061b3a7fbb9374553300cd84afd3e ssh/ssh_host_rsa_key.pub
28a29
> 6623a0b9a50f5d2667493c607f90dfc35c6724336edeac025b78c2272aa5275a docker/key.json
31,33c32,34
< 92a0cf6eb61d190ea096489fd09c0563668ebbb5b6ca914e42ef84632abadde3 locale.conf
< 2c24f906bd707982c8b4641356b7ecdbad91416bcc4be4efecc0650648d4c64a .ignition-result.json
< f5a1ab9f2e2037ef09207155dee9a3a66478017f1e51de28df0eeda3b127c5f4 machine-id
---
> f37bdaf5772c0dfa5f5a6b5a09db5f4d883865e75a7488a67fd43d8ab57b793d .updated
> 6477b5f8323e29513bc22507fe87aefa39c5cb1a06da57e58300b1fc3f4a9325 .ignition-result.json
> 99e096e993f971f74c6c4afdcb916f4c19d9a8e55e7fd7a7a92ec6f301ee7658 machine-id
36c37,38
< 508268ffb93137cdb4495bfce92f7326e9a9449f8938ae747e3631001da1f060 shadow-
---
> 92a0cf6eb61d190ea096489fd09c0563668ebbb5b6ca914e42ef84632abadde3 locale.conf
> 8b0639e910dce777a0701c0485aaaca12d79ae126dc9e51964d144b51e7cdaa1 shadow-
38c40
< b4f674e3a5c71cb704819fe620140053f47eebed14c34a30f1308a71c5da9992 gshadow-
---
> 50c4cf5a5a5eec488d489c5ced894ec36e1ca7ede930f39c83bd96e33ea0772a gshadow-
41,42d42
< f37bdaf5772c0dfa5f5a6b5a09db5f4d883865e75a7488a67fd43d8ab57b793d .updated
< e7cc4cb080e1a655361436f7b102664e71674860c4210ba45c3a85e5d572cacd docker/key.json
Unfortunately, this kind of testing is time consuming (loading the OVF into vSphere apparently). To me it seems to be the F36 to F37 transition that is the issue. How would I know what the last F36 is and the first 'good' F37 is? (Not being an expert in cincinnati trees etc)
If I were doing this I'd probably just look at the history in the unofficial builds browser and bisect between 36.20220906.1.0
and latest. It's definitely time consuming (unfortunately), but would help the investigation along. I'm definitely not an expert on swarm or container networking so bisecting and looking at diffs is the best I've got to offer.
Before I start down that avenue, any reaction to the diffs above? The good news is it is fairly short, and I think you'd expect the differences for many of the files listed? Anything that you want more info on?
I'm having trouble parsing the output in https://github.com/coreos/fedora-coreos-tracker/issues/1372#issuecomment-1375885311. Here's a new command to run:
sudo -i
cd /etc
for f in $(ostree admin config-diff | cut -d " " -f 5 | sort); do sha256sum $f ; done
Run this separately on the good node and on the bad node and post each of those outputs (you can add a .txt
attachment here).
I basically ran that before. Here are the two files it produced: upgraded.f37.sha.diff.txt new.f37.sha.diff.txt
What I posted previously is the diff of these two files to eliminate common files with identical SHAs. To do that properly though I should have probably re-sorted by field two before diffing however.
Following your suggestions I tried to narrow down to the problematic release on the stable stream and to my surprise it is pre-f37:
I'm going to gather the files you suggested above for these versions tomorrow.
Following your suggestions I tried to narrow down to the problematic release on the stable stream and to my surprise it is pre-f37:
- Working: 36.20221014.3.1
- Non-working: 36.20221030.3.0
And to be clear.. both of these work when initially provisioned, but after fully updating (all the way to F37) the one that started at 36.20221030.3.0
no longer works?
The f-c-c diff between those two versions is: https://github.com/coreos/fedora-coreos-config/compare/340bc23af03163d8569fc5cee9667f051c9e0025...59530d10327c0dc975857d120af7d72e30b22626
The COSA diff between those two versions is: https://github.com/coreos/coreos-assembler/compare/89f06f542dc1e9cdeae0d8dfb1a8b46e7da4adba...e8676668f7c1718e982a2081f3ac4b8d15590834
The package diff between those two versions is:
$ rpm-ostree --repo=./ db diff e75cd529cfc15329d9b1cb80b1fc83f8af3a70029b015da2b8a8d7c17bac9b3c eab21e5b533407b67b1751ba64d83c809d076edffa1ff002334603bf13655a14
ostree diff commit from: e75cd529cfc15329d9b1cb80b1fc83f8af3a70029b015da2b8a8d7c17bac9b3c
ostree diff commit to: eab21e5b533407b67b1751ba64d83c809d076edffa1ff002334603bf13655a14
Upgraded:
NetworkManager 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
NetworkManager-cloud-setup 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
NetworkManager-libnm 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
NetworkManager-team 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
NetworkManager-tui 1:1.38.4-1.fc36 -> 1:1.38.6-1.fc36
aardvark-dns 1.1.0-1.fc36 -> 1.2.0-6.fc36
amd-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36
bash 5.1.16-3.fc36 -> 5.2.2-2.fc36
btrfs-progs 5.18-1.fc36 -> 6.0-1.fc36
chrony 4.2-5.fc36 -> 4.3-1.fc36
containers-common 4:1-59.fc36 -> 4:1-62.fc36
coreos-installer 0.16.0-1.fc36 -> 0.16.1-2.fc36
coreos-installer-bootinfra 0.16.0-1.fc36 -> 0.16.1-2.fc36
ethtool 2:5.19-1.fc36 -> 2:6.0-1.fc36
fedora-release-common 36-18 -> 36-20
fedora-release-coreos 36-18 -> 36-20
fedora-release-identity-coreos 36-18 -> 36-20
git-core 2.37.3-1.fc36 -> 2.38.1-1.fc36
glibc 2.35-17.fc36 -> 2.35-20.fc36
glibc-common 2.35-17.fc36 -> 2.35-20.fc36
glibc-minimal-langpack 2.35-17.fc36 -> 2.35-20.fc36
gnutls 3.7.7-1.fc36 -> 3.7.8-2.fc36
grub2-common 1:2.06-53.fc36 -> 1:2.06-54.fc36
grub2-efi-x64 1:2.06-53.fc36 -> 1:2.06-54.fc36
grub2-pc 1:2.06-53.fc36 -> 1:2.06-54.fc36
grub2-pc-modules 1:2.06-53.fc36 -> 1:2.06-54.fc36
grub2-tools 1:2.06-53.fc36 -> 1:2.06-54.fc36
grub2-tools-minimal 1:2.06-53.fc36 -> 1:2.06-54.fc36
intel-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36
kernel 5.19.15-201.fc36 -> 6.0.5-200.fc36
kernel-core 5.19.15-201.fc36 -> 6.0.5-200.fc36
kernel-modules 5.19.15-201.fc36 -> 6.0.5-200.fc36
libidn2 2.3.3-1.fc36 -> 2.3.4-1.fc36
libksba 1.6.0-3.fc36 -> 1.6.2-1.fc36
libmaxminddb 1.6.0-2.fc36 -> 1.7.1-1.fc36
libsmbclient 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
libwbclient 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
libxml2 2.9.14-1.fc36 -> 2.10.3-2.fc36
linux-firmware 20220913-140.fc36 -> 20221012-141.fc36
linux-firmware-whence 20220913-140.fc36 -> 20221012-141.fc36
moby-engine 20.10.18-1.fc36 -> 20.10.20-1.fc36
netavark 1.1.0-1.fc36 -> 1.2.0-5.fc36
nvidia-gpu-firmware 20220913-140.fc36 -> 20221012-141.fc36
podman 4:4.2.1-2.fc36 -> 4:4.3.0-2.fc36
podman-plugins 4:4.2.1-2.fc36 -> 4:4.3.0-2.fc36
rpm-ostree 2022.13-1.fc36 -> 2022.14-1.fc36
rpm-ostree-libs 2022.13-1.fc36 -> 2022.14-1.fc36
rsync 3.2.6-1.fc36 -> 3.2.7-1.fc36
runc 2:1.1.3-1.fc36 -> 2:1.1.4-1.fc36
samba-client-libs 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
samba-common 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
samba-common-libs 2:4.16.5-0.fc36 -> 2:4.16.6-0.fc36
ssh-key-dir 0.1.3-2.fc36 -> 0.1.4-1.fc36
tzdata 2022d-1.fc36 -> 2022e-1.fc36
vim-data 2:9.0.720-1.fc36 -> 2:9.0.803-1.fc36
vim-minimal 2:9.0.720-1.fc36 -> 2:9.0.803-1.fc36
Added:
containers-common-extra-4:1-62.fc36.noarch
I basically ran that before. Here are the two files it produced: upgraded.f37.sha.diff.txt new.f37.sha.diff.txt
What I posted previously is the diff of these two files to eliminate common files with identical SHAs. To do that properly though I should have probably re-sorted by field two before diffing however.
Looking at those two files it appears these are the files that are different (different SHA-256) between the two:
shadow
gshadow
ssh/ssh_host_ed25519_key
ssh/ssh_host_ed25519_key.pub
ssh/ssh_host_ecdsa_key
ssh/ssh_host_ecdsa_key.pub
ssh/ssh_host_rsa_key
ssh/ssh_host_rsa_key.pub
issue.d/30_coreos_ignition_provisioning.issue
issue.d/21_clhm_ssh_host_keys.issue
.ignition-result.json
machine-id
shadow-
gshadow-
docker/key.json
I don't think there is anything surprising in there and likely nothing that would explain the described difference in behavior.
I'm out of idea and I would recommend asking in the Moby/Docker issue tracker for help.
Following your suggestions I tried to narrow down to the problematic release on the stable stream and to my surprise it is pre-f37:
- Working: 36.20221014.3.1
- Non-working: 36.20221030.3.0
I'm going to gather the files you suggested above for these versions tomorrow.
Sorry if I wasn't clear earlier @dustymabe but I'm now saying that this issue manifests prior to F37 so that is new news and probably warrants changing the issue title.
Its as simple as you can create a swarm with overlay network with cross-node communicating containers on the first FCOS version listed, but you can't on the second and all future versions (up until today's current stable).
There is the slight twist that you can have a working setup if you start from the first listed FCOS version and upgrade to F37 but in terms of focus I'd say the interesting question is why a new swarm on the former works but it doesn't on the latter.
Where is the Swarm configuration stored? How does Swarm actually work?
Note that the first non-working FCOS version is the point at which the upgrade to docker 20.10.20 occurs. So, your suggestion @travier to follow up there may be a good idea. I'm going to change issue description to pinpoint on 20.10.20.
This thread reports similar behaviour: https://github.com/moby/moby/issues/41775
For the record the vSphere environment this issue is manifesting in is: 7.0.3.01100
And, VMWare VM version for the non-working 36.20221030.3.0 embedded in the OVF is 17.
Note, this issue does not manifest using the OVAs on VMWare Fusion 13.0.0 further pointing to a VMWare vSphere specific networking type incompatibility.
According to the release notes:
So this is likely fallout from:
You can verify that by trying out the downgrade instructions.
Downgrading fedora-coreos-37.20230110.1.0-vmware.x86_64.ova to vmx machine 13 fixes the problem. Wow! Sorry it has taken so long to confirm this.
Unfortunately I don't have much insight into why that change would have caused this problem. Does anyone with VMWare expertise know?
I experimentally upgraded to vmx machine 19 hoping that there might have been a bug fix that addressed this. Nope.
Nothing in the hardware feature matrix seems immediately relevant, but I presume there are other hardware changes not listed.
sudo ethtool -K <ens192 or whatever outbound interface> tx off
executed on the nodes concerned fixes the issue.
This switch controls where tcp checksums are performed on the interface driver or not. Switching it off means the checks are performed on the host with a consequent cpu performance penalty but no loss of functionality.
An alternate solution mentioned (as yet untested) seems to indicate that switching from a vmxnet3 driver to a simple E1000E card emulator fixes the issue implying this is an issue with the VmxNet3 interface driver.
Other potential driver options are listed here but more importantly this page seems to confirm that the driver version is a function of the vmware machine ID and the guest OS. This would explain why all might work fine for vmware machine ID 13 but not vmware machine ID 17 (or 19).
Confirmed that via vSphere UI replacing network interfaces with E1000-based cards fixes the problem so indeed seems to be specifically vmxnet3 driver related.
@fifofonix it would be good to get some of this information into the BZ you opened as well.
At this point I think we are in one of two cases here.
If we could pinpoint which version of the kernel the regression was introduced in then we could provide more information to upstream kernel maintainers. You can try out older kernels by rpm-ostree override replace http://path/to/kojiid
them from the koji builds.
Either way we're going to have to find the proper people or list to send this information to. Maybe @jmflinuxtx can point us in the right direction.
The other issue with the BZ relates to an issue seen on already-provisioned 'old' vmware machine ID 13 machines upgrading to latest OS versions. In this instance it is newly provisioned 'new' vmware machine ID 17 machines experiencing issues. I guess it is possible that there is some vmware level networking issue common to both but I haven't seen the connection yet to say they are related.
As for this issue I was thinking this is not a FCOS issue at all but an issue with a vmware network card emulator (or whatever the right name is for this type of software). In my mental model I thought by making the tx off
switch I was preventing the offload to a buggy vmxnet3 and having the guest OS do that instead of vmware. That would make more sense to me in the sense that we're only seeing this occur on vmware and with vmxnet (now that we have shown that by editing the machine to switch from vmxnet3 to E1000E fixes things?
As you say some experts on this would be great.
This remains an issue with 37.20230205.1.0
on vSphere 7.0 3j (ie. build 20990077)
Note that to toggle this switch via a NetworkManager connection file in /etc/NetworkManager/system-connections
the syntax is:
...
[ethtool]
feature-tx-checksum-ip-generic=false
...
sudo ethtool -K <ens192 or whatever outbound interface> tx off
executed on the nodes concerned fixes the issue.This switch controls where tcp checksums are performed on the interface driver or not. Switching it off means the checks are performed on the host with a consequent cpu performance penalty but no loss of functionality.
An alternate solution mentioned (as yet untested) seems to indicate that switching from a vmxnet3 driver to a simple E1000E card emulator fixes the issue implying this is an issue with the VmxNet3 interface driver.
Other potential driver options are listed here but more importantly this page seems to confirm that the driver version is a function of the vmware machine ID and the guest OS. This would explain why all might work fine for vmware machine ID 13 but not vmware machine ID 17 (or 19).
Confirmed on my setup that swapping to E1000 interface addresses the problem.
Confirmed that this issue still applies at 38.20230322.1.0 with vSphere 7.0.3.01200
Confirmed that this issue still applies at 38.20230414.1.0 (6.2.9-300.fc38.x86_64) with vSphere 7.0.3.01200
I think our status is still at https://github.com/coreos/fedora-coreos-tracker/issues/1372#issuecomment-1382433252
Basically we need to find relevant upstream people who can fix this in the driver(s) itself, right?
Agreed. I'm just periodically testing to see whether it gets fixed magically.
Hi we were facing the same issue and we have found a workaround. When creating the swarm, we must specify a data-path-port different from the default value 4789 (see docs)....
So it means if we create the swarm using this command:
docker swarm init --data-path-port=38888
(38888 is just an example, you can set any value as long as it is not 4789)
then pb of invalid checksum disappears... and it's no more needed to deactivate checksum control in the network driver.
I have no idea why we have such different behavior with port 4789...
Interesting. I haven't tested your solution but I found this when googling it and it seems the root of the problem is a conflict with VMware NSX's communication port for VXLAN. This is good to know. Thanks!
我esxi 7.0.2,安装centos8 centos7 debian12 也是这个问题,docker swarm 不能通过Overlay跨物理机访问端口.也就是你说的,docker exec -it
Describe the bug
Containers running as part of a service cannot communicate with each other across nodes on an overlay network on newly commissioned docker swarms on vSphere using a FCOS 37+ image.
This issue does not affect nodes provisioned on AWS.
This issue does not affect newly created FCOS 36 vSphere-based nodes, nor FCOS36 nodes that have auto-upgraded to FCOS 37. This remains true even if the swarm is entirely destroyed and re-created.
Reproduction steps
docker swarm init
, node2:<output from docker swarm init command>
)docker network create -d overlay test_network
docker service create --network test_network --replicas 4 nginx
docker exec -it <node1-container-1-id> curl <node1-container-2-id>
. Succeeds.docker exec -it <node1-container-1-id> curl <node2-container-1-id>
. Hangs and then times out. Attempting from node 2 to node 1 fails also.Expected behavior
In step 7 we expect the same result as that from step 6.
We are deploying via terraform. If we change one thing and specify an F36 OVA, e.g. 36.20220906 then all steps above will succeed.
Furthermore, if the nodes are allowed to upgrade to the latest F37 then all tests continue to succeed. Deleting the swarm, and repeating re-creation also yields successful steps 1-7.
Actual behavior
Step 7 hangs eventually timing out.
Note, that DNS resolution is fine. The container ID is resolved to the correct IP address on the other node. Installing traceroute in the containers shows that it is not possible to find a route.
System details
Ignition config
This ignition has been manually edited with some values expunged.
Additional information
No SELinux denials. No journal logs at all when the failing curl is made. There are docker unit errors on overlay network creation VXLAN errors but these are the same for F36 and F37 and so do not seem to be relevant. Nothing obvious (to me) in journals.