Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

AKS in VNET behind company HTTP proxy #205

Closed markwaldkat closed 2 years ago

markwaldkat commented 6 years ago

I need to deploy AKS into a custom VNET, that is behind a company HTTP proxy to access the public internet.

With ACS or acs-engine I couldn't get this working out-of-the-box as the cloud-init scripts need internet access before I'm able to set the http_proxy on all nodes.

Is this possible with AKS once #27 is supported?

boubou191911 commented 6 years ago

+1. I have a similar setup where the AKS subnet has no direct access to Internet except via a forward proxy.

eosfor commented 6 years ago

+1

haroldwongms commented 5 years ago

Checking to see if there is any update on this issue. I'm being asked by my customers for the ability to include HTTP Proxy settings during deployment of the cluster since they require all Internet traffic to go through a proxy.

rajshakerp commented 5 years ago

I see no good way to restrict the traffic, in-bound/out-bound looking for a similar solution.

xorima commented 5 years ago

👍

palma21 commented 5 years ago

Today you can restrict outbound connectivity as per https://aka.ms/aks/egress. Inbound connectivity is not required.

For this you can use NSGs or FWs: eg. https://aka.ms/aks/secure

We will evaluate providing support to define proxy settings at create time and fully disconnected create options.

tommyJimmy87 commented 4 years ago

Hi, is this issue still on progress?

I have the same scenario, basically a corporate VNET where I'm gonna create AKS in. Unfortunately I have to set the proxy up for accessing internet. Is that currently possible?

zomarg commented 4 years ago

Regarding better control on Egress (and for cases where we have to use a forward proxy because of security requirements) you can use a Service Mesh like Istio.

Using external HTTPS Proxy with Istio: https://istio.io/docs/tasks/traffic-management/egress/http-proxy/#configure-traffic-to-external-https-proxy

Hope you find it helpful.

tommyJimmy87 commented 4 years ago

Thanks @GramozKrasniqi for your answer but I hope that there are other way rather then set up Istio, which I don't really need in this case.

Also I have the same problem pulling the docker images as well and that won't be useful.

heoelri commented 4 years ago

Today you can restrict outbound connectivity as per https://aka.ms/aks/egress. Inbound connectivity is not required.

For this you can use NSGs or FWs: eg. https://aka.ms/aks/secure

We will evaluate providing support to define proxy settings at create time and fully disconnected create options.

@palma21 any news on this? I've another customer with a simmilar request. They want to route all egress traffic through their corporate proxyserver infrastructure. Is/will this be a supported scenario?

The options i was thinking about are using the regular egress filtering way via Azure Firewall + a deamon set that configures a few things (apt) to use a proxy server (but i'm not sure if that's a good idea as it then has to leave the Azure datacenter even if the repo is close to it).

davemedvitz commented 4 years ago

Working with a Azure Gov customer who is required to go through a proxy for their AKS deployment. They've waited months for AKS, only to not be able to use it.

eosfor commented 4 years ago

@davemedvitz , I assume this might be an answer

https://docs.microsoft.com/en-us/azure/aks/private-clusters

davemedvitz commented 4 years ago

That might be, when private endpoints are in Gov...

But even then, there doesn't appear to be a mechanism to add proxy config to the nodes so that they can build properly.

crgarcia12 commented 4 years ago

+1 - I need to be able to provide root CA certificates before the cluster makes any call to the internet, due to SSL Proxy inspectors in the way.

heoelri commented 4 years ago

We've started to put together some stuff around the use AKS (Engine) based K8s clusters behind proxy servers that might be applicable to AKS as well. https://github.com/Azure/aks-engine/blob/master/docs/topics/proxy-servers.md

davemedvitz commented 4 years ago

The manual piece of that document represents most of what I do to get my AKS nodes to finally run (then i have to convince the cloud-init to run again, and then the Custom Script extension). I'd have to look again, but the DaemonSet piece of the solution I don't believe will work. The reason for this is (unless it changed on me when I wasn't looking) that the scripts used to set up the node first us netcat (nc) to test for the direct connectivity of one of the container registries. When this fails, the node configuration doesn't run, so the node doesn't get built, and never registers with the control plane.

I've not tried with WIndows nodes at all, maybe I'll put that on my list of things to do.

Not sure if the AKS Engine does something similar.

-Dave

On Thu, May 14, 2020 at 3:07 AM heoelri notifications@github.com wrote:

We've started to put together some stuff around the use AKS (Engine) based K8s clusters behind proxy servers that might be applicable to AKS as well. https://github.com/Azure/aks-engine/blob/master/docs/topics/proxy-servers.md

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Azure/AKS/issues/205#issuecomment-628434226, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKA6UUR5TOI5EZB2TFOYGJDRROKBVANCNFSM4ER3QRXA .

ghost commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

Tbohunek commented 4 years ago

This issue has been automatically marked as stale because it has not had any activity for 60 days. It will be closed if no further activity occurs within 15 days of this comment.

Yes, we still want it! :)

ponson-thankavel commented 4 years ago

I am looking for the same feature... OpenShift supports it, why not AKS?

Tbohunek commented 4 years ago

I am looking for the same feature... OpenShift supports it, why not AKS?

Sadly, OpenShift doesn't support proxy either.

ponson-thankavel commented 4 years ago

I am looking for the same feature... OpenShift supports it, why not AKS?

Sadly, OpenShift doesn't support proxy either.

Are you sure? See this documentation from Openshift - https://docs.openshift.com/aro/4/networking/enable-cluster-wide-proxy.html

Tbohunek commented 4 years ago

I am looking for the same feature... OpenShift supports it, why not AKS?

Sadly, OpenShift doesn't support proxy either.

Are you sure? See this documentation from Openshift - https://docs.openshift.com/aro/4/networking/enable-cluster-wide-proxy.html

Thanks for the documentation link. I have checked with our team and no, OpenShift still doesn't fully support proxy, some components such as calls to Azure Management API don't support proxy. We have to deploy transparent proxy.. 😢

markmassad commented 3 years ago

Can we get an update on this? It seems from @miwithro that a related feature (e.g. CustomNodeConfiguration) was closed back in January. We have a few (large) tenants who cannot even touch AKS w/o forward-proxy support, especially in Gov.

evergata commented 2 years ago

Hello, how long it will last the public preview before going GA ?

miwithro commented 2 years ago

Just moved this to Public Preview looking to move this to GA by end of the calendar year.

mihohoi0322 commented 2 years ago

Thank you GA!! I tried proxy-setting. But I got a error message...

(HTTPProxyConfigInputError) HTTPProxy and HTTPSProxy both cannot be empty.

config file is bellow:

{
    "httpProxyConfig": {
        "httpProxy": "http://xxx.xxx.xxx.xxx:80/",
        "httpsProxy": "http://xxx.xxx.xxx.xxx:80/",
        "noProxy": [
            "localhost",
            "127.0.0.1"
        ]
    }
}

I have checked the documentation and created a json file, is there a setting I am missing? If you need to create a new issue, please let me know.

miwithro commented 2 years ago

@mihohoi0322 did you add in "trustedCa"?

mihohoi0322 commented 2 years ago

@miwithro Thank you for your comment. I did not add 'trustedCa' this time because it was AKS for internal verification. Is this required? I added a temporary string and tried again, but the same error was displayed.

miwithro commented 2 years ago

Adding in @alexeldei to look into this.

JulioFor commented 2 years ago

@mihohoi0322 I don't know if it worked becouse I have to ask for opening the urls in the proxy. At first I got the same error using az cli (az aks create): (HTTPProxyConfigInputError) HTTPProxy and HTTPSProxy both cannot be empty. But if you use ARM it seems to work. At lease it leaves you to create the AKS. az deployment group create --name ExampleDeployment --resource-group myrg --template-file aks-proxy-config-arm.json

JulioFor commented 2 years ago

@miwithro @mihohoi0322 Finally!!!! I made it work with az cli!! What was the problem? The json format. It has to go without httpProxyConfig. It make sense since it says that it can't find the parameters. So the correct way to create the json is simply:

{ 'httpProxy': 'http://xxxx:8080/', 'httpsProxy': 'https://xxxx:8080/', 'noProxy':[ 'localhost', '127.0.0.1' ], }

Confirmed that it works

JulioFor commented 2 years ago

Omg it couldn't be that easy. When we use our windows nano images it gives us this: Failed to save Kubernetes service 'xxxx'. Error: HTTPProxyConfig is not supported for OS type: . Do I suppose that windows is not supported with this feature...?

miwithro commented 2 years ago

Windows is not supported yet for this feature. @justindavies to track on the Windows side.

JulioFor commented 2 years ago

@miwithro I am still testing this feature and have a new issue that I think will be resolved in the future. I can't pull images.

Failed to pull image "alpine:3.13": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/library/alpine:3.13": failed to resolve reference "docker.io/library/alpine:3.13": failed to do request: Head "https://registry-1.docker.io/v2/library/alpine/manifests/3.13": proxyconnect tcp: net/http: TLS handshake timeout

I am using Azure DNS, an unrestricted proxy... And it is a problem related to the proxy feature, because if you add a Windows node, which does not use the proxy feature, it works fine when pulling Windows images. I've even try other proxys if it was the problem but nothing

As a walkaround I have created a private link to my container repository and add it to "noProxy" zone its url. This way you never scape from Azure. If pulling images is not going to work with proxys I think it is something more related with documentation. For me it is ok since it adds another security layer since you need all images to be into your private repository, but I don't know if it is made on purpose.

AndreasMWalter commented 2 years ago

@justindavies @miwithro #2259 I am currently working for a customer that uses a transparent proxy with TLS Inspection. Is it possible that you would allow configuring the trustedCa independently from httpProxy, httpsProxy and no_proxy setting?

Also from the Documentation it appears that no_proxy cannot be updated after the cluster is deployed. Making this configurable after the fact would greatly increase the usability of the feature.

{
    "trustedCa": "XYZBASE64CA"
}

image

And something that would also help is the ability to inject a list of trusted PKI certificates as an independent feature.

CC @jabbera

phgogo commented 2 years ago

@justindavies @miwithro #2259 I am currently working for a customer that uses a transparent proxy with TLS Inspection. Is it possible that you would allow configuring the trustedCa independently from httpProxy, httpsProxy and no_proxy setting?

I totally support this! The Proxy settings are currently the only workaround to trust a 3rd Party CA on cluster Nodes (without manually injecting them via SSH), even if you don't use a proxy.

Being able to set HTTP_PROXY="" while als importing the CA would help us a lot. Currently our only workaround is, to list all needed TLDs in NO_PROXY like =.com,.net,.io... Cuttently our workaround

jabbera commented 2 years ago

Couple of points. We use multiple issuers for different edges so being able to specify more then 1 cert would be helpful.

Second, you can do this today with a daemonset but it takes FOREVER for the node to startup properly.

http://hypernephelist.com/2021/03/23/kubernetes-containerd-certificate.html

miwithro commented 2 years ago

We have another feature in our backlog to allow customers to configure there CA as a seperate node configuration. We hope to look at this early Q1 2022.

https://github.com/Azure/AKS/issues/2259

phgogo commented 2 years ago

We have another feature in our backlog to allow customers to configure there CA as a seperate node configuration. We hope to look at this early Q1 2022.

2259

Maybe it's an option for you to work together with the people working on this proxy feature?

Since they already have a working way of importing CAs during an AKS-Deployment (which is also already well integrated in API-consuming Tools like Terraform) it seem's easier to me to add the possiblity to deploy a CA without Proxy config than working on an individual custom CA integration feature.

miwithro commented 2 years ago

@phgogo we are one in the same :)

hakabo commented 2 years ago

When the http proxy feature is enabled, we're finding that inter cluster communication no longer works (ie, its forced to the proxy) when referencing the service by anything other than a fqdn.

We have added svc.cluster.local to the no_proxy parameter, but that, as expected does not help bypass the proxy when trying to reach any of the following

http://svc-name/
http://svc-name.namespace/
http://svc-name.namespace.svc/

only using http://svc-name.namespace.svc.cluster.local/ by passes the proxy with our svc.cluster.local addition to no proxy.

Is this expected behavior? Is there a solution to this?
Appreciate your help,

hakabo commented 2 years ago

When configuring the http_proxy feature directly with our proxy address, the feature appears to be working well. The cluster builds, system pods deploy.. fantastic.

However, we have two proxies, and we more commonly use a load balanced address or a DNS alias for the proxy address. When either of these two are used, the cluster fails times out and fails to build.

We're using Terraform and the failure message is as follows

Code="ControlPlaneAddOnsNotReady" Message="Pods not in Running status: metrics-server-6576d9ccf8-dsdr8,tunnelfront-64f444f956-rmjzb,tunnelfront-65c6b5944b-fhwfh,coredns-845757d86-gfxrp,coredns-autoscaler-7d56cd888-rh9hz

If we switch back to using the address of the proxy directly (rather than an alias for the proxy for example) the cluster builds without issue. Our networking team report they can not see any traffic hitting the proxy when using the alias (which is used in many other instances with success).

My question is, how can I troubleshoot this? Where would be the appropriate logs for this sort of failure? Has anyone else reported any issues when using an alias? My suspicion is that a certificate check is failing as instead of proxy.domain.com we're trying to use an alias, proxyAlias.domain.com - but we've not experienced an issue with this method in any other instance.

Any help in troubleshooting this would be appreciated.

(Using 1.22.6)

alexeldeib commented 2 years ago

how can I troubleshoot this? Where would be the appropriate logs for this sort of failure?

are you able to get pod logs or kubectl describe pod for the pods listed as not Running? does the cert have SANs properly configured for all the relevant domains?

we're finding that inter cluster communication no longer works Is this expected behavior? Is there a solution to this?

See also https://github.com/Azure/AKS/issues/2674

you can override the system injected no_proxy pod env pod vars, we will respect the values if you set them yourself. would that solve your issue?

alexeldeib commented 2 years ago

Our networking team report they can not see any traffic hitting the proxy when using the alias

well that at least rules TLS issues I would think, I'd try pod logs and/or a tcpdump from pod/node failing to reach through proxy if the pod logs aren't obvious.

unfortunately proxy configuration itself is somewhat out of scope for AKS but happy to do what I can to make sure our config should work for your scenario -- nothing you mentioned jumps out to me immediately as an issue.

hakabo commented 2 years ago

thanks for getting back to me @alexeldeib

are you able to get pod logs or kubectl describe pod for the pods listed as not Running? does the cert have SANs properly configured for all the relevant domains?

The pods are Pending due to "no nodes available to schedule pods" Looking at Node Pools\Nodes via the portal, there are none listed. Looking at the scaleset, its reporting a failure due to

VM has reported a failure when processing extension 'vmssCSE'. Error message: "Enable failed: failed to execute command: command terminated with exit status=50 [stdout] [stderr] % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0* Could not resolve proxy: proxyalias.domain.com * Closing connection 0 curl: (5) Could not resolve proxy: proxyalias.domain.com Command exited with non-zero status 5 0.00user 0.00system 0:00.00elapsed 66%CPU (0avgtext+0avgdata 11024maxresident)k 0inputs+8outputs (0major+603minor)pagefaults 0swaps " More information on troubleshooting is available at https://aka.ms/VMExtensionCSELinuxTroubleshoot

Not being able to resolve proxyalias.domain.com (the real fqdn has been replaced for this post) is unexpected as the DNS servers that are configured on the VNET (which I expect to be inherited into the nodes) are able to resolve the proxyalias fqdn. We know from other systems/services that proxyalias works. We also know that if proxyalias was changed to proxy.domain.local the cluster deploys without issue.

well that at least rules TLS issues I would think, I'd try pod logs and/or a tcpdump from pod/node failing to reach through proxy if the pod logs aren't obvious.

I'll continue to investigate and try and direct access to the nodes. kubectl debug obviously doesn't work, i'd need to ssh directly into the nodes - currently not enabled.

you can override the system injected no_proxy pod env pod vars, we will respect the values if you set them yourself. would that solve your issue?

We can override, and have done with a few addresses including svc.cluster.local as mentioned aboce. However, no matter what is added to the no_proxy override you still would not be able to reference http://svc-name/ - Adding overrides would not be a practical or scaleable solution as we do not know the name of all possible namespaces and services that we would be using in the future.

appreciate your help

alexeldeib commented 2 years ago

i'd need to ssh directly into the nodes - currently not enabled.

I think you could try Azure Bastion or else worst case public IP per node. Unfortunately I don't have much insight into the DNS failure beyond what you've already shared -- checking /etc/resolv.conf and manually trying to resolve DNS from the node while tcpdump'ing would probably be my next steps.

no matter what is added to the no_proxy override you still would not be able to reference http://svc-name/

err, can you elaborate? AKS itself no_proxies konnectivity for the in-cluster service and it seems to work fine.

Adding overrides would not be a practical or scaleable solution as we do not know the name of all possible namespaces and services that we would be using in the future.

right, but presumably at the time you create a new service, nothing is using it, and you can create the service and then add it to no_proxy overrides. in the future we could potentially make no_proxy updateable at the cluster level. I think the combo of those could solve this issue?

I don't know how we can generally avoid no_proxy'ing services...maybe a webhook or something which adds all in-cluster service names to no_proxy by default? but this could be tricky

alexeldeib commented 2 years ago

err, can you elaborate? AKS itself no_proxies konnectivity for the in-cluster service and it seems to work fine.

ah, you are saying there is no catch all override for the whole cluster like mentioned elsewhere. I follow now.

hakabo commented 2 years ago

Apologies if im not making myself clear, let me try again

A service with the dns recordmy-service in namespace my-ns should be accessible by pods in my-ns by referencing/looking up my-service. A pod in another namespace would need to reference my-service.my-ns - thats standard kubernetes behaviour as I understand it.

Our cluster has been behaving as described above, until we enabled the http proxy feature. Now, anything that isn't caught by no_proxy is forwarded to the proxy. my-service, my-service.my-ns gets forwarded to the proxy. An attempt to reference the service by any name gets forwarded to the proxy. We have added svc.cluster.local to the cluster level no_proxy and so now, referencing the service by fqdn works, anything other than this fails.

I notice that the cluster is adding the service cidr range and node cidr range to the no_proxy variable - this does not seem to having an effect on excluding service addresses from the proxy.

This, I feel, breaks a fundamental aspect of service DNS behaviour. So, if I am correct, it could be highlighted in the upcoming documentation and more preferable, MS come up with some awesome enhancement so that behaviour remains the same with http proxy enabled.

Hopefully you are able to at least reproduce this in your labs so I can be sure that this isn't just a local issue and my understanding of the situation is correct. -thanks.

alexeldeib commented 2 years ago

Yes, I follow the issue.

my-service, my-service.my-ns gets forwarded to the proxy

right, you can override this per-service but it's very manual. You could also write a webhook to automatically inject these to no_proxy. That might be the route we investigate for AKS. But it's going to get a little dicey since we don't know everything about user's services, need to filter e.g. externalName services, etc.

agreed we need to document this case better either way

hakabo commented 2 years ago

Fantastic, appreciate you responses on this. If I make headway on the proxyalias I'll come back and post. thanks.