coreos / tectonic-installer

Install a Kubernetes cluster the CoreOS Tectonic Way: HA, self-hosted, RBAC, etcd Operator, and more
Apache License 2.0
601 stars 267 forks source link

Azure: [Building cluster with | updating to] CL 1409.2.0 yields broken Tectonic Console #1171

Closed justaugustus closed 6 years ago

justaugustus commented 7 years ago

BUG REPORT

Versions

What happened?

While building Tectonic at an Azure client, @metral and I noticed that access to the Tectonic Console was in an inconsistent state. Originally, we operated on the assumption that something was amiss with IAM policies and network security groups (NSGs), but upon further investigation on several subnets, we confirmed the issue was specific to Tectonic.

API access / kubectl commands work as expected. Tectonic Console access is flaky / nonexistent, usually resulting in an ERR EMPTY RESPONSE error

What you expected to happen?

User should be able to access the Tectonic Console and subsequently log in.

How to reproduce it (as minimally and precisely as possible)?

Building any cluster on Azure using current master (https://github.com/coreos/tectonic-installer/commit/bcc6ca488c0da8afd711c050e30b684088d56299)

In addition you'll need to add the following to the bootkube module platforms/azure/tectonic.tf to test:

etcd_cert_dns_names = [
  "${var.tectonic_cluster_name}-etcd-0.${var.tectonic_base_domain}",
  "${var.tectonic_cluster_name}-etcd-1.${var.tectonic_base_domain}",
  "${var.tectonic_cluster_name}-etcd-2.${var.tectonic_base_domain}",
  "${var.tectonic_cluster_name}-etcd-3.${var.tectonic_base_domain}",
  "${var.tectonic_cluster_name}-etcd-4.${var.tectonic_base_domain}",
  "${var.tectonic_cluster_name}-etcd-5.${var.tectonic_base_domain}",
  "${var.tectonic_cluster_name}-etcd-6.${var.tectonic_base_domain}",
]

Anything else we need to know?

This issue is specific to clusters running CL 1409.2.0. It initially presented when a dev cluster's CLUO began updating nodes from 1353.8.0 to 1409.2.0. One of the masters successfully updated and another was stuck in a pending reboot state. Once this occurred, Tectonic Console access began inconsistent, while k8s API access remained intact.

Subsequent cluster builds present with the ERR EMPTY RESPONSE error on Tectonic Console.

The behavior from a Chrome Incognito window is such:

I can provide the curl commands from Developer Console under separate cover.

We've tried the following scenarios:

All of these confirmed that this occurs upstream and on our fork.

From there, we tested hardcoding the CL version to 1353.8.0, commenting out CLUO, and ensuring locksmithd is masked:

Again, We've tried the following scenarios:

This was successful. I'm not sure what components within CL would be causing this, but we have isolated the behavior to 1409.2.0.

In addition, someone will need to test other platforms to confirm whether this bug is specific to Azure or not.

cc: @metral, @alexsomesan, @s-urbaniak, @Quentin-M

alekssaul commented 7 years ago

Hi @justaugustus ,

I've tested 1409.2.0 earlier on VMware, however my testing followed; bring up the cluster, login to console as steps without keeping the cluster around for extended period of time.

Would you mind providing additional details around the symptoms. It sounds like the actual frontend components are running as expected, however tectonic-console throws ERR EMPTY RESPONSE. My assumption is that this error is thrown when user attempts to login to the console. I am curious:

justaugustus commented 7 years ago

@alekssaul Thanks for the VMware feedback! To answer your questions:

  1. Clusters w/ CLUO active on 1353.8.0 will have Tectonic Console in a degraded state. I believe some of the nodes upgrade to 1409.2.0 and some are held back. When you hit a 1353.8 node, log in is successful.
  2. I know @metral was tracking some of the kubectl logs, he can chime in with some specifics
  3. Not outside of what I mentioned above:
cehoffman commented 7 years ago

I've experienced this too with the same steps as outlined in 3 above. However I think it may be a problem in the routing to the cluster too. If I setup an SSH tunnel to another geographic location and access the console everything works fine. From inside our corporate network though we get this problem. The cluster is in East US 2 and access from Alabama is fine, when the connection comes from Missouri it fails.

Inspecting the requests, I've noticed the failed ones have larger/more headers than the successful ones. The request to identity/callback? for an external oidc provider includes a large Referer header in our case and removing that header and retrying the request works successfully. However the console is still inaccessible because the next request now contain a large Cookie header due to the auth token.

At the least, the request header size plays a role in the rate of success for clients that have failures.

This all became evident after the upgrade to 1409.2.0, but I have not tried an OS rollback yet since only some geographic areas have the problem for us.

Quentin-M commented 7 years ago

This has not been reproduced on AWS nor VMWare so far. Given @cehoffman's answer and what have been discussed offline (frontend.js taking 15-20s to load), it seems that this might be related to Azure's network or Azure's load balancers rather than Container Linux, Dex or Kubernetes itself.

Note that this does not exclude the possibility that the kube-proxy's configuration or the ingress controller's configuration might play a role here (e.g. time-outs).

Quentin-M commented 7 years ago

Could one reproduce with a simple nginx-based application passing around big requests / headers?

cehoffman commented 7 years ago

In a support request with Microsoft, I captured the network traffic from a failing client and each of the master nodes in the backend pool for the load balancer. They analyzed this and gave this suggestion. I have not applied this change though, since this is to an in use cluster.

After looking through the logs I was able to confirm that the “server hello” packets are not being fragmented properly on their return route.

We have a possible solution but there is no guarantee as it pertains to changing settings on the coreOS VMs that we are not subject experts with. The process would be to change the MTU settings on each backend VM to 1350. This can be done by changing the settings found in “/usr/lib/systemd/network/99-default.link”

I don't feel this is the right solution because I expect it will effectively make the master inoperable with the rest of the cluster. I hope this suggestion and information might help the CoreOS team narrow down possible differences between the 1353.8.0 and 1409.2.0 releases which would contribute to this problem.

metral commented 7 years ago

I just tried 1409.5.0 as it just got added to the list of VM images on Azure, and because it was thought there was some tie to the bug called out in https://github.com/coreos/bugs/issues/2016 with 1409.2.0 & /etc/resolv.conf

However, 1409.5.0 did not work, and produces the same ERR_EMPTY_RESPONSE issues on the console in the browser as 1409.2.0.

cehoffman commented 7 years ago

This isn't just limited to the console and identity applications. GitLab behind a separate Azure LB also exhibits this problem with a handful of resources. Notably accessing pipeline runner configuration times out when accessed from the bad networks for console and identity. Access from another clients that work for console and identity also work for GitLab.

I've informed the MS rep on the support ticket that access from within the VNet does not display this problem, so they may look into problems/causes due to the LB.

metral commented 7 years ago

@cehoffman Thank you for the continued feedback and updates. Please update this issue accordingly with any other information you get back from MSFT. Could you elaborate on the GitLab setup you have and how that relates to Tectonic?

Based on more tests I've conducted recently, I've even tried removing the LB's in place for console/identity to use an nginx proxy instead, I set the nginx proxy to be the FQDN for the console/cluster so this is the only LB in the setup, and running on 1409.5.0 I still hit the ERR_EMPTY_RESPONSE. Pinning this branch to 1353.8.0 is the only solution that worked, so something tells me it may not necessarily be the ILB. Random network latencies & time outs have been noticed in various forms e.g. on the clusters stood up in the OP are seeing latencies of ~17 seconds to pull down the frontend.js on Console and its only 1.5MB - I'm wondering if vnets are having weird experiences lately?

cehoffman commented 7 years ago

@metral

GitLab is running within the same cluster using the official gitlab helm chart and is fronted by a separate nginx ingress controller beta.8 and separate Azure LB. This provides a data point that the problem may not be in the console or identity applications. This configuration too was working without issue in 1353.8.0, but developed problems when 1409.2.0 and 1409.5.0

cehoffman commented 7 years ago

Something that is different between Azure and the other providers is WALinuxAgent loaded onto the OS. Not all machines in the cluster appear to have the same version. The majority have 2.2.13 as the current_version, but one master has 2.2.12 and one worker has 2.2.10. Both versions stayed below 2.2.13 through a reboot. I don't know anything about the purpose behind this tool, but from logs it appears to auto upgrade itself and it seems odd that one would be an older version.

sym3tri commented 7 years ago

@metral have you considered the CLUO update or CL version may be a red herring? What if you manually boot to a known working fixed version and manually cycle the nodes 1 by 1. Does everything work in that case?

metral commented 7 years ago

@sym3tri Potentially, but we've had a working cluster on 1353.8.0 that slowly become unusable as 1409.2.0 rolled out on June 20th. One node even got stuck pending in reboot and it was the only one left on 1353.8.0, while the rest of the nodes were updated to 1409.2.0. This cluster was pretty much unusable unless the request ended up being picked up by the 1353.8.0 machine.

@justaugustus any missing details ^?

cehoffman commented 7 years ago

I can confirm that taking a non working set of worker nodes that had upgrades to 1409.2.0, provisioning new workers into the cluster pinned to 1353.8.0 and then deprovisioning the 1409.2.0 workers resolved the network and empty responses for me. I even confirmed that having the network request LB to a node on 1409.2.0 specifically and then routed to a 1353.8.0 node for actual service would cause the issue to manifest. It is something to do with the network stack on 1409.2.0 instances.

metral commented 7 years ago

I just created a public cluster off master (a07e2e5fa66a7aaf4aa9fded3a80b367c1696325) on Azure, and the following data points are interesting as far as accessing the Tectonic Console from different machines + browsers goes:

crawford commented 7 years ago

@metral the Windows Azure Linux Agent is a python tool shipped within CL that is needed to talk to the Azure fabric. It's shipped in the OEM partition and is never updated (which is why those version numbers aren't changing). It re-execs itself after updating so it should always run the latest. It is mainly responsible for setting up the resource disk, creating users, setting passwords and SSH keys, and shouldn't be messing with the network or anything else on the system.

cehoffman commented 7 years ago

Overall the latency in on 1409.2.0 and later is considerably worse. Looking at an external service providing latency reporting on a Quay Enterprise registry I had running in the cluster, you can see a drastric drop in latency from 1.2s to ~600ms when I replaced the workers with 1353.8.0 images from 1409.5.0.

screen shot 2017-06-28 at 12 41 18 pm

~Can anyone point me to a repository or anything that tracks the differences between CoreOS versions in a more granular way that what is available at https://coreos.com/releases/~ I found the coreos/manifests repo and the tags for the versions. I've started looking at which included tools changed and seeing if anything pops out as a likely candidate.

metral commented 7 years ago

@cehoffman Thank you for the continued updates and data, it is greatly appreciated!

I'm running network captures on port 32000 of masters and am seeing lots of retransmission of TCP segments, ACK's etc. after signing in and just sitting on the main page. This could be a sign of network congestion in 1) CL 1409.x.0 2) Azure or both, but further discovery is required to be certain.

Please let us know if you discover any other interesting data points.

metral commented 7 years ago

After further investigation, it appears that vxlan changes in the recent 4.11.x kernel included in CL 1409.x.0 are not playing well with Azure's network driver, and has introduced a regression. MSFT/Azure is aware of the issue.

As a temporary work-around, configuring Flannel to use the UDP backend instead of VXLAN, and opening the port 8285 in the master & worker NSG of the vnet module does away with the issue until a better solution is found.

cehoffman commented 7 years ago

Thanks @metral. Is there any kind of issue number in Azure/MSFT to track this. I have a contact that can keep me updated on resolution, but he needs some way of finding the issue.

squat commented 7 years ago

I wrote up a doc describing the minimal setup for reproducing the VXLAN issue on Azure. This document should serve for testing different kernels to see if they exhibit the issue: https://gist.github.com/squat/1c2799c3565c383fe4b1499c101bfc49.

cc @crawford @sym3tri @metral @justaugustus

metral commented 7 years ago

Thank you @squat for the very thorough issue and instructions, it's greatly appreciated.

metral commented 7 years ago

A patch has been released by MSFT for the VXLAN issue in the HyperV support: https://www.spinics.net/lists/stable/msg178150.html.

This patch was applied, tested, and confirmed as working on kernel v4.11, the version introduced in CL 1409.x.0.

A patch release will be made to CL, and is expected to be released next week.

robszumski commented 6 years ago

This rolled out to the stable channel yesterday. Closing, but reopen if needed.

cehoffman commented 6 years ago

1409.7.0 did not result in a fix for me. I have not had a chance to create a new cluster from scratch, but upgrading an existing one to 1409.7.0 from 1353.8.0 on all nodes still results in err empty response.

crawford commented 6 years ago

I've been attempting to narrow down what's happening and I've found something interesting. Container Linux 1353.8.0, which multiple parties have verified as working, is no longer working. This leads me to believe that something in the Azure infrastructure has changed.

cehoffman commented 6 years ago

So, I've been feeling something is off with 1353.8 over the last week or 2 now. I don't have the empty response body with it though. On Thu, Jul 27, 2017 at 22:31 Alex Crawford notifications@github.com wrote:

I've been attempting to narrow down what's happening and I've found something interesting. Container Linux 1353.8.0, which multiple parties have verified as working, is no longer working. This leads me to believe that something in the Azure infrastructure has changed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coreos/tectonic-installer/issues/1171#issuecomment-318548140, or mute the thread https://github.com/notifications/unsubscribe-auth/AABrmbX0Hgor6hAgj7UnM_l2FsvWKWddks5sSVYMgaJpZM4OCGpn .

cehoffman commented 6 years ago

Can confirm this morning once I got into work (a location where this err empty response manifests), that I am now receiving the same problem on the cluster that has not changed since 5pm central yesterday when it was last checked and working from this location. The cluster is using only 1353.8.0 nodes.

crawford commented 6 years ago

@cehoffman Which region and VM sizes are you using?

cehoffman commented 6 years ago

@crawford I am in East US 2 with master nodes using Standard_DS3_v2_Promo and the workers using Standard_DS12_v2_Promo

cehoffman commented 6 years ago

I dunno if this is useful at all, but when I isolate the ingress controller and console to be one instance and on the same machine there is no problem accessing the console. This tells me that the problem is purely when the traffic has to leave the VM and traverse the Azure VNet. Just more confirmation that it is a problem in either how flannel, kernel, or Azure handles the container traffic.

crawford commented 6 years ago

@cehoffman Do you have boot logs from when this did work? I'm specifically interested in the line that starts with Hyper-V Host.

crawford commented 6 years ago

Just more confirmation that it is a problem in either how flannel, kernel, or Azure handles the container traffic.

Our repro case doesn't actually use flannel, so that eliminates one more potential cause. I agree that it is some interaction between the kernel and Azure. I don't think it is purely the kernel (because this used to work) and I don't think it's purely Azure (otherwise a lot more people would be affected).

squat commented 6 years ago

@cehoffman our testing and repro case tells us it's an issue with in VXLAN interaction between Azure and the kernel.

On Fri, Jul 28, 2017 at 9:16 AM Alex Crawford notifications@github.com wrote:

Just more confirmation that it is a problem in either how flannel, kernel, or Azure handles the container traffic.

Our repro case https://gist.github.com/squat/1c2799c3565c383fe4b1499c101bfc49 doesn't actually use flannel, so that eliminates one more potential cause. I agree that it is some interaction between the kernel and Azure. I don't think it is purely the kernel (because this used to work) and I don't think it's purely Azure (otherwise a lot more people would be affected).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/coreos/tectonic-installer/issues/1171#issuecomment-318697439, or mute the thread https://github.com/notifications/unsubscribe-auth/ATiQP8g8GkEGKwX1PEg1KQ6ig6wiW8PYks5sSglvgaJpZM4OCGpn .

cehoffman commented 6 years ago

Unfortunately I don't have any boot logs.

crawford commented 6 years ago

It occurred to me that a change in DCHP could have affected this. If Azure used to respond with the MTU option (which I don't see anymore), that could have increased it enough to hide the issue in most cases.

cehoffman commented 6 years ago

Interesting thought. Would it be possible to set the MTU of the default nic through ignition to get around this when provisioning new nodes.

crawford commented 6 years ago

@cehoffman I believe so. We did some basic testing after bumping the MTU to 1600 and it looks okay. We are waiting to hear back from Microsoft about a safe default. In the mean time, here was the config I put together:

{
        "ignition": { "version": "2.0.0" },
        "networkd": {
                "units": [{
                        "name": "00-eth0.link",
                        "contents": "[Match]\nOriginalName=eth0\n\n[Link]\nNamePolicy=kernel database onboard slot path\n\nMACAddressPolicy=persistent\n\nMTUBytes=1600"
                }]
        }
}
squat commented 6 years ago

I just tested a cluster with the necessary netowrkd link unit in place. With the requests on vxlan work; without the unit, requests on the same cluster blackhole. Looks good right now.

cehoffman commented 6 years ago

Using 1600 was better but not perfect. I found that the backup sidecar for etcd-operator was unable to do backups with this setting. I took the suggested MTU from Microsoft support I posted earlier and tried it. This resulted in the backup sidecar working and one of the other symptoms to get much better.

The symptom I'm referencing is gaps in the prometheus monitoring that gets reported for pod and node stats in the tectonic console UI. This is a screenshot of the symptom:

screenshot at aug 01 18-51-21

I believe this is from the general network instability resulting in blackholed requests or duplicated acks. The fact it gets better with these changes suggests it is related and before this problem surfaced the graphs had no gaps to my recollection. This cluster has 130 pods and 10 nodes in case this is expected behavior.

crawford commented 6 years ago

@cehoffman That's interesting. Were you able to get to the console when you set the MTU on all of the nodes to 1350? When you tried 1600, was that on all nodes?

cehoffman commented 6 years ago

Yep to both. Full cluster cycle so all nodes had the MTU setting. I also used 1409.7.0 on all the nodes for both MTU tests.

cehoffman commented 6 years ago

I believe this to be related.

I've been demoing Portworx integration with kubernetes. Nightly the volume for a postgresql container would become unusable due to i/o errors. You'd have to stop the container, mount the volume in the portworx container, and run an fs repair. This test isn't an abnormal use case for portworx, so I believe the problem stemmed from this networking issue rather than portworx normal operation.

I had tried the 1600 MTU overnight the night before last and this problem occurred yet again. Last night I had the cluster using 1350 MTU. The problem did not manifest. Will need to let it keep going, but I also have a more strenuous test that resulted in problems nearly on demand (have to wait about 1.5 hours after start). When I get more time I plan to launch the strenuous test again.

crawford commented 6 years ago

@cehoffman I wanted to give you an update on what we've discovered. We provisioned a bunch of vanilla clusters (no changes to the MTU; using the latest Container Linux) in westcentralus and found that some of them succeeded and some of them failed. On the clusters that failed, if we disabled TX checksum offloading (sudo ethtool -K eth0 tx off), they begin to work properly! Since this only happens some of the time, I'm led to believe that there is some sort of hardware/hypervisor defect.

In our initial investigation, we found that packets greater than 1370 in length were being dropped, but anything shorter always succeeded. We validated that setting the MTU to 1350 also causes clusters to work properly. This makes it sound like some hardware/hypervisor is having trouble performing checksums on packets greater than 1370 in length.

Turning off TX checksum offloading and/or lowering the MTU to 1350 appear to mitigate the issue. Pretty crazy.

cehoffman commented 6 years ago

@crawford This is wild. It makes more sense though as a workaround than the MTU changes.