Azure / kubelogin

A Kubernetes credential (exec) plugin implementing azure authentication
https://azure.github.io/kubelogin/
MIT License
477 stars 88 forks source link

The kubectl commands become significant slower when using kubelogin #102

Closed JiayangZhou closed 1 year ago

JiayangZhou commented 2 years ago

In AKS, we started to get warnings such as below: image

So we adopted kubelogin by running the below commands: kubelogin convert-kubeconfig -l azurecli, which is non-interactively login to AKS. And then every time we run a kubectl command, we can feel a delayed response time. The issue is gone if do the normal way of refreshing AKS token via az aks get-credentials.

The kubeconfig file looks something like this when the issue is present:

- name: clusterUser_clustername
  user:
    exec:
      apiVersion: client.authentication.k8s.io/v1beta1
      args:
      - get-token
      - --server-id
      - XXXXX
      - --login
      - azurecli
      command: kubelogin
      env: null
      provideClusterInfo: false

This has been reported by many of our developers, but everybody uses the same AKS cluster, we will also initiate an Azure support ticket and see if it has something to do on cluster level, and also try in different clusters. But feel this is more related to kubelogin.

JiayangZhou commented 2 years ago

Ran a test with this bash script:

start_time=$(date +%s)
# perform a task
kubectl get ns
kubectl get pods
kubectl get pv
kubectl get ingress -A
kubectl config view
kubectl get secret
kubectl get jobs
kubectl get deployment

end_time=$(date +%s)

# elapsed time with second resolution
elapsed=$(( end_time - start_time ))
echo "==================================="
echo "The time elapsed is:" $elapsed"s"

and the result is:

dev aks + non-interactive | 12 | 33 | 11 | 10 -- | -- | -- | -- | -- dev aks + interactive | 9 | 7 | 9 | 8

tested in a different AKS cluster (1.22.4):

dev02 aks + non-interactive | 36 | 14 | 32 | 9 -- | -- | -- | -- | -- dev02 aks + interactive | 8 | 7 | 7 | 7

also in another aks cluster:

dev03 aks + non-interactive | 32 | 32 | 13 | 32 -- | -- | -- | -- | -- dev03 aks + interactive | 9 | 8 | 8 | 8
weinong commented 2 years ago

@JiayangZhou kubelogin caches token from azurecli, and the first invocation may be slower than the rest. In my testing below, it seems reasonable, albeit my cluster is very empty.

#!/bin/bash

function run () {
start_time=$(date +%s)
$1 > /dev/null 2>&1
end_time=$(date +%s)
elapsed=$(( end_time - start_time ))
echo $1
echo "The time elapsed is:" $elapsed"s"
echo "==================================="
}

# clean up the cache
kubelogin remove-tokens

# perform a task
run "kubectl get ns"
run "kubectl get pods"
run "kubectl get pv"
run "kubectl get ingress -A"
run "kubectl config view"
run "kubectl get secret"
run "kubectl get jobs"
run "kubectl get deployment"

kubectl get ns
The time elapsed is: 3s
===================================
kubectl get pods
The time elapsed is: 0s
===================================
kubectl get pv
The time elapsed is: 0s
===================================
kubectl get ingress -A
The time elapsed is: 1s
===================================
kubectl config view
The time elapsed is: 0s
===================================
kubectl get secret
The time elapsed is: 0s
===================================
kubectl get jobs
The time elapsed is: 0s
===================================
kubectl get deployment
The time elapsed is: 0s
===================================
JiayangZhou commented 2 years ago

Thank you for your responses! I have run your script in our cluster multiple times, and interesting enough the time elapse spikes randomly in some kubectl commands, and such delayed behavior is not consistent:

test0:

$ bash kubectl.sh
kubectl get ns
The time elapsed is: 5s
===================================
kubectl get pods
The time elapsed is: 4s
===================================
kubectl get pv
The time elapsed is: 2s
===================================
kubectl get ingress -A
The time elapsed is: 2s
===================================
kubectl config view
The time elapsed is: 1s
===================================
kubectl get secret
The time elapsed is: 1s
===================================
kubectl get jobs
The time elapsed is: 1s
===================================
kubectl get deployment
The time elapsed is: 2s

test1:

$ bash kubectl.sh
kubectl get ns
The time elapsed is: 5s
===================================

kubectl get pods
The time elapsed is: 24s
===================================
kubectl get pv
The time elapsed is: 1s
===================================
kubectl get ingress -A
The time elapsed is: 1s
===================================
kubectl config view
The time elapsed is: 1s
===================================
kubectl get secret
The time elapsed is: 1s
===================================
kubectl get jobs
The time elapsed is: 2s
===================================
kubectl get deployment
The time elapsed is: 1s
===================================

test2:

$ bash kubectl.sh
kubectl get ns
The time elapsed is: 4s
===================================
kubectl get pods
The time elapsed is: 2s
===================================
kubectl get pv
The time elapsed is: 1s
===================================
kubectl get ingress -A
The time elapsed is: 2s
===================================
kubectl config view
The time elapsed is: 1s
===================================

kubectl get secret
The time elapsed is: 24s
===================================
kubectl get jobs
The time elapsed is: 2s
===================================
kubectl get deployment
The time elapsed is: 1s
anmathew commented 2 years ago

Can you please time kubelogin invocations too?

$ cat wrapper 
#!/bin/bash 

start=$(date +%s)
kubelogin.inner "$@"
end=$(date +%s)
echo "$end Took $((end - start)) seconds for command: |$@|" >> /tmp/log &

replace wrapper as kubelogin:

$ cp `which kubelogin` `which kubelogin`.inner -v 
$ cp wrapper `which kubelogin` -v

Run your kubectl.sh:

$ bash kubectl.sh
kubectl get ns
The time elapsed is: 3s
===================================
kubectl get pods
The time elapsed is: 0s
===================================
kubectl get pv
The time elapsed is: 0s`
...

See report on kubelogin runs:

$ cat /tmp/log
1658168985 Took 0 seconds for command: |remove-tokens|
1658168985 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
1658168987 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
...
weinong commented 2 years ago

@JiayangZhou, you can add -v 5 to get the timestamp of each api call kubectl makes. I'm pretty sure the latency is from these api calls, instead from kubelogin.

JiayangZhou commented 2 years ago

Quoting: https://github.com/Azure/kubelogin/issues/102#issuecomment-1185297464 @weinong

I did another test on the first AKS cluster used in quoted reply, but from the windows subsystem, installed kubelogin via brew, here is some interesting result (running the same script from quoted reply)

dev aks + non-interactive (kubelogin convert-kubeconfig -l azurecli) | 6 | 5 | 5 | 3 | 3 | 4 | 5 | 4 -- | -- | -- | -- | -- | -- | -- | -- | -- dev aks + interactive | 4 | 4 | 4 | 5 | 4 | 4 | 4 | 4

kubelogin version on WSL: 

user@SEKINJZHOU101:~$ kubelogin --version
kubelogin version
git hash: v0.0.15/820e730d56613c14a68b5e831ade4f274b6d1bd2
Go version: go1.17.12
Build time: 2022-07-16T00:07:31Z

kubelogin version on Win11:

 PS C:\Users\jzhou1> kubelogin --version
kubelogin version
git hash: v0.0.13/52e83a071f39e9e039e95aa9a6fbea04855eae13
Go version: go1.17.9
Build time: 2022-04-23T00:49:14Z

seems that it has something to do with type of the OS it is on

JiayangZhou commented 2 years ago

Can you please time kubelogin invocations too?

$ cat wrapper 
#!/bin/bash 

start=$(date +%s)
kubelogin.inner "$@"
end=$(date +%s)
echo "$end Took $((end - start)) seconds for command: |$@|" >> /tmp/log &

replace wrapper as kubelogin:

$ cp `which kubelogin` `which kubelogin`.inner -v 
$ cp wrapper `which kubelogin` -v

Run your kubectl.sh:

$ bash kubectl.sh
kubectl get ns
The time elapsed is: 3s
===================================
kubectl get pods
The time elapsed is: 0s
===================================
kubectl get pv
The time elapsed is: 0s`
...

See report on kubelogin runs:

$ cat /tmp/log
1658168985 Took 0 seconds for command: |remove-tokens|
1658168985 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
1658168987 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
...

@anmathew

I followed your instructions under WSL, not sure how to do it from Windows, but here is the result:

user@SEKINJZHOU101:~$ bash kubectltest2.sh
kubectl get ns
The time elapsed is: 6s
===================================
kubectl get pods
The time elapsed is: 1s
===================================
kubectl get pv
The time elapsed is: 1s
===================================
kubectl get ingress -A
The time elapsed is: 1s
===================================
kubectl config view
The time elapsed is: 0s
===================================
kubectl get secret
The time elapsed is: 1s
===================================
kubectl get jobs
The time elapsed is: 1s
===================================
kubectl get deployment
The time elapsed is: 1s
===================================
user@SEKINJZHOU101:~$ cat /tmp/log
1658224070 Took 0 seconds for command: |remove-tokens|
1658224072 Took 2 seconds for command: |get-token --login azurecli --server-id 6dae42f8-4368-4678-94ff-3960e28e3630|
1658224076 Took 0 seconds for command: |get-token --login azurecli --server-id 6dae42f8-4368-4678-94ff-3960e28e3630|
1658224077 Took 0 seconds for command: |get-token --login azurecli --server-id 6dae42f8-4368-4678-94ff-3960e28e3630|
1658224078 Took 0 seconds for command: |get-token --login azurecli --server-id 6dae42f8-4368-4678-94ff-3960e28e3630|
1658224079 Took 0 seconds for command: |get-token --login azurecli --server-id 6dae42f8-4368-4678-94ff-3960e28e3630|
1658224081 Took 1 seconds for command: |get-token --login azurecli --server-id 6dae42f8-4368-4678-94ff-3960e28e3630|
1658224081 Took 0 seconds for command: |get-token --login azurecli --server-id 6dae42f8-4368-4678-94ff-3960e28e3630|

cannot replicate the latency issue from WSL

JiayangZhou commented 2 years ago

Quoting: https://github.com/Azure/kubelogin/issues/102#issuecomment-1185297464

I updated the kubelogin release version used in Win11 to the latest via here https://github.com/Azure/kubelogin/releases

and run some tests in the same dev cluster in quoted reply:

dev aks + non-interactive (kubelogin convert-kubeconfig -l azurecli) | 7 | 19 | 11 | 31 | 7 | 10 | 7 | 7 | 11 | 17 | 17 -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- dev aks + interactive | 6 | 6 | 6 | 6 | 13 | 10 | 9 | 8 | 7 | 7 | 8 ``` PS C:\Users\jzhou1> kubelogin --version kubelogin version git hash: v0.0.16/b61a6ef2b83d81e57b5ecc15d384b9da367f365e Go version: go1.17.12 Build time: 2022-07-18T20:32:10Z ``` non-kubelogin still performs better
anmathew commented 2 years ago

Can we try to tell how many seconds it took for kubectl and for kubelogin?

change your run() from https://github.com/Azure/kubelogin/issues/102#issuecomment-1186021812 to:

function run () {
start_time=$(date +%s)
echo "-- kubectl start: $start_time cmd: |$1|" >> /tmp/log &
$1 > /dev/null 2>&1 
end_time=$(date +%s)
elapsed=$(( end_time - start_time ))
echo $1
echo "-- kubectl end: $end_time duration: $elapsed cmd: |$1|" >> /tmp/log &
}

run with wrapper in place,

rm /tmp/log
./kubectl.sh

looking for an output like:

1658256006 Took 0 seconds for command: |remove-tokens|
-- kubectl start: 1658256006 cmd: |kubectl get ns|
1658256006 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256009 duration: 3 cmd: |kubectl get ns|
-- kubectl start: 1658256009 cmd: |kubectl get nodes -A|
1658256009 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256009 duration: 0 cmd: |kubectl get nodes -A|
-- kubectl start: 1658256009 cmd: |kubectl get pods|
1658256009 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256009 duration: 0 cmd: |kubectl get pods|
-- kubectl start: 1658256009 cmd: |kubectl get pv|
1658256010 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256010 duration: 1 cmd: |kubectl get pv|
-- kubectl start: 1658256006 cmd: |kubectl get ns|
1658256006 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256009 duration: 3 cmd: |kubectl get ns|

In the above example, most of the time (3 seconds) was from kubectl. I am hoping to see similar details (time elapsed for both kubectl and kubelogin) with your runs like:

kubectl get secret
The time elapsed is: 24s
===================================
weinong commented 2 years ago

@JiayangZhou, is slowness observed on WSL2, WSL1, or Windows 11?

JiayangZhou commented 2 years ago

@JiayangZhou, is slowness observed on WSL2, WSL1, or Windows 11?

Hi, @weinong it’s only observed on Win11 for us. On WSL2 the kubectl commands via kubelogin vs non-kubelogin took more or less the same time, see the test result here https://github.com/Azure/kubelogin/issues/102#issuecomment-1188835254, not tested on WSL1.

JiayangZhou commented 2 years ago

Can we try to tell how many seconds it took for kubectl and for kubelogin?

change your run() from #102 (comment) to:

function run () {
start_time=$(date +%s)
echo "-- kubectl start: $start_time cmd: |$1|" >> /tmp/log &
$1 > /dev/null 2>&1 
end_time=$(date +%s)
elapsed=$(( end_time - start_time ))
echo $1
echo "-- kubectl end: $end_time duration: $elapsed cmd: |$1|" >> /tmp/log &
}

run with wrapper in place,

rm /tmp/log
./kubectl.sh

looking for an output like:

1658256006 Took 0 seconds for command: |remove-tokens|
-- kubectl start: 1658256006 cmd: |kubectl get ns|
1658256006 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256009 duration: 3 cmd: |kubectl get ns|
-- kubectl start: 1658256009 cmd: |kubectl get nodes -A|
1658256009 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256009 duration: 0 cmd: |kubectl get nodes -A|
-- kubectl start: 1658256009 cmd: |kubectl get pods|
1658256009 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256009 duration: 0 cmd: |kubectl get pods|
-- kubectl start: 1658256009 cmd: |kubectl get pv|
1658256010 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256010 duration: 1 cmd: |kubectl get pv|
-- kubectl start: 1658256006 cmd: |kubectl get ns|
1658256006 Took 0 seconds for command: |get-token --login azurecli --server-id redacted|
-- kubectl end: 1658256009 duration: 3 cmd: |kubectl get ns|

In the above example, most of the time (3 seconds) was from kubectl. I am hoping to see similar details (time elapsed for both kubectl and kubelogin) with your runs like:

kubectl get secret
The time elapsed is: 24s
===================================

Hi @anmathew The delayed response is not presented on a Linux-like system, any way to tell how many seconds it took for kubectl and for kubelogin on Windows system?

anmathew commented 2 years ago

@JiayangZhou perhaps give https://github.com/anmathew/timeIt a whirl and see if it helps you to measure your execution times?

JiayangZhou commented 2 years ago

Thank you @anmathew tried timeit but encountered an issue on Windows https://github.com/anmathew/timeIt/issues/2

JiayangZhou commented 1 year ago

Hi, I have done some more expermients where I executed a series of kubectl commands in a script in each scenario down below and noted down the processing time.


OS: WIN11

  1. Without kubelogin, az aks get-credentials --name <cluster_name> --resource-group <rg_name>, around 7 seconds on average to finish.
  2. With kubelogin, kubelogin convert-kubeconfig -l azurecli about 30 seconds taken to finish the commands. Refer to the guide from this link
  3. With kubelogin, kubelogin convert-kubeconfig -l spn --tenant-id <tenant_id> --client-id <app_id> --client-secret <secret_content> about 35 seconds on average taken to finish the commands. Refer to the guide from this link

    OS: WSL2

  4. Without kubelogin, az aks get-credentials --name <cluster_name> --resource-group <rg_name>, around 4 seconds on average to finish.
  5. With kubelogin, kubelogin convert-kubeconfig -l azurecli about 4 seconds taken to finish the commands. Refer to the guide from this link
  6. With kubelogin, kubelogin convert-kubeconfig -l spn --tenant-id <tenant_id> --client-id <app_id> --client-secret <secret_content> about 9 seconds on average taken to finish the commands. Refer to the guide from this link
kpkool commented 1 year ago

@JiayangZhou

Can you please provide version for following please ?

  1. kubectl version --client
  2. kubelogin –version
  3. az version

-thanks, KP

JiayangZhou commented 1 year ago

Hi @kpkool

  1. kubectl version --client Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.2", GitCommit:"8b5a19147530eaac9476b0ab82980b4088bbc1b2", GitTreeState:"clean", BuildDate:"2021-09-15T21:38:50Z", GoVersion:"go1.16.8", Compiler:"gc", Platform:"windows/amd64"}
  2. kubelogin --version kubelogin version git hash: v0.0.16/b61a6ef2b83d81e57b5ecc15d384b9da367f365e Go version: go1.17.12 Build time: 2022-07-18T20:32:10Z
  3. az version { "azure-cli": "2.39.0", "azure-cli-core": "2.39.0", "azure-cli-telemetry": "1.0.6", "extensions": {} }

Thanks for following up!

kpkool commented 1 year ago

@JiayangZhou ,

I would like to apologize for the late response, please rest assured that we are still working from our end to get a proper fix for this issue. That being said, these are the recommendations we get from the engineering team:

1) Please do not use kubelogin v0.0.16 which is it has been reported to have multiple issues. Please upgrade kubelogin to the latest version. ([Releases · Azure/kubelogin (github.com)]

2) The token is cached in .kube\cache\kubelogin\ of the home directory. Can you please check whether the cached token exists?

I will be pending on your feedback.

-KP

JiayangZhou commented 1 year ago

Thank you @kpkool for following up.

We are using the latest kubelogin version and our team is still experiencing significant slowness on Windows. Plus, we have upgraded our AKS to 1.24.X from 1.22.X

There is a kubelogin folder under .kube\cache, some people don't have a kubelogin folder, and some people do have kubelogin folder with a AzurePublicCloud token.

We run a couple of kubelogin commands and capture the time elapsed, it varies between 35-37 secs for windows users, but only 3-4 secs for WSL user.

AndersClausen commented 1 year ago

Any update on this issue? We're on Windows 10 and use the latest versions of everything else and it's as slow as @JiayangZhou describes it.

Eneuman commented 1 year ago

We have the same problem

weinong commented 1 year ago

@Eneuman @AndersClausen what kind of login mode do you use? is it in WSL or not? what kind of Windows OS? most of all, can you confirm it's slow by running kubelogin get-token ...

AndersClausen commented 1 year ago

Hi @weinong I'm using the Azure CLI Windows 10 Enterprise (64-bit). I am not using WSL.

With regards to the kubelogin get-token command - I'm not familiar with it, so could you please provide all of the input it needs and I'll run it. Cheers

abhilash-pulluri commented 1 year ago

Hi @weinong, we too have the same issue.

AndersClausen commented 1 year ago

Just a quick update to say that kubelogin works well through Ubuntu in WSL2 (but still weird that it's not working great in normal Windows mode).

Tatsinnit commented 1 year ago

💡☕️❤️ Hiya All, and thank you for this discussion\thoughts and ideas, 🙏❤️ did some windows only test and sharing the code for timing it, (if it comes handy) - Sharing this as a building block if anyone find anything concrete and reproducible please.

I am not able to replicate any huge difference between Kubelogin non-interactive mode on my windows machine, I do wonder if users should try after running kubelogin remove-tokens 🤷🏽‍♂️ given my cluster is not busy I don't see massive difference but I wonder if there is any consistent repro anyone has? (Or is it case of just OAuth taking time to validate for some - just thinking out loud)

The Time code I used for Windows machine is as willows and below are some results which are not bad:

OS Info:

MicrosoftTeams-image (14)

Windows timing.bat file for help

Saved below as timecmd.bat and some output below:

@echo off
@setlocal

set start=%time%

:: Runs your command
cmd /c %*

set end=%time%
set options="tokens=1-4 delims=:.,"
for /f %options% %%a in ("%start%") do set start_h=%%a&set /a start_m=100%%b %% 100&set /a start_s=100%%c %% 100&set /a start_ms=100%%d %% 100
for /f %options% %%a in ("%end%") do set end_h=%%a&set /a end_m=100%%b %% 100&set /a end_s=100%%c %% 100&set /a end_ms=100%%d %% 100

set /a hours=%end_h%-%start_h%
set /a mins=%end_m%-%start_m%
set /a secs=%end_s%-%start_s%
set /a ms=%end_ms%-%start_ms%
if %ms% lss 0 set /a secs = %secs% - 1 & set /a ms = 100%ms%
if %secs% lss 0 set /a mins = %mins% - 1 & set /a secs = 60%secs%
if %mins% lss 0 set /a hours = %hours% - 1 & set /a mins = 60%mins%
if %hours% lss 0 set /a hours = 24%hours%
if 1%ms% lss 100 set ms=0%ms%

:: Mission accomplished
set /a totalsecs = %hours%*3600 + %mins%*60 + %secs%
echo command took %hours%:%mins%:%secs%.%ms% (%totalsecs%.%ms%s total)

outcome:

I used one cluster without AAD and one With AAD.

If I run it again the command time was faster (shave off second more).

image

On second run on same cluster:

MicrosoftTeams-image (17)

Thanks heaps,

RagingTonberry commented 1 year ago

I have observed this behavior consistently while using kubelogin since last year (after also noting the deprecation of the other plugin) on Windows 10/11. In my experience, it's always the kubelogin executable with some sort of expensive call or system interaction (maybe defender?) which occurs before it does any actual work of its own and has a periodicity of around 30s:

PS> while ($true) { (Measure-Command {kubelogin --version}).Seconds; Sleep -Seconds 1 }
7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
7
0
0
0
PS>

And again with a sleep of 15:

PS> while ($true) { (Measure-Command {kubelogin --version}).Seconds; Sleep -Seconds 15 }
7
0
7
0
7
0
PS>

I haven't been able to figure out what is causing this every 30s or so and it's driving me nuts. Using 0.0.27 at the moment haven't gotten .28 but this behavior has been unchanged since last year and all subsequent versions.

Given others have reported no such issues under WSL or on linux it seems to suggest it could be related to a Windows interaction rather than the executable itself?

jeroenlandheer commented 1 year ago

I've got the same issues here on Win11. FWIW, adding an exclusion to defender does not help.

RagingTonberry commented 1 year ago

I've got the same issues here on Win11. FWIW, adding an exclusion to defender does not help.

Ah, I did fail to mention that I tried this too. That was my first guess, but yeah, no dice :(

peterbom commented 1 year ago

From this (and from testing on my machine), it looks as if .kube\cache is not used for caching tokens when using the azurecli login type.

Instead, it calls az account get-access-token -o json --resource 6dae42f8-4368-4678-94ff-3960e28e3630. If you run this command repeatedly, is it slow? It's not for me, but I'm using a newer az version (2.48.1).

The az CLI has its own caching (for me this seems to be .azure\msal_token_cache.bin), so it shouldn't be needing to do the whole authentication dance each time.

weinong commented 1 year ago

kubelogin does different things based on login mode. like @peterbom said, when using -l azurecli it invokes az cli. So when reporting slowness, please be specific about the login mode used

peterbom commented 1 year ago

please be specific about the login mode used

I think @JiayangZhou specified that the login mode is azurecli when opening the issue.

peterbom commented 1 year ago

Based on my speculation that it's something to do with the az CLI cache, not the kubelogin one, I tried the following command on Windows 11 using CLI v2.39.0 (which @JiayangZhou was using):

az account get-access-token -o json --resource 6dae42f8-4368-4678-94ff-3960e28e3630

I was hoping this would reproduce the slow behaviour, but it didn't for me (the kubelogin/kubectl commands all run reasonably fast too so there's something different with my setup).

@JiayangZhou, @AndersClausen, @Eneuman, @abhilash-pulluri, @jeroenlandheer, @RagingTonberry: would you be able to check whether the above command is also slow for you, and if so which az CLI version you're running?

stevehipwell commented 1 year ago

I'm seeing this too on Windows 11, usually with a kubectl command taking about 25s to run. Some commands see this issue more often than others but AFAIK it happens to all command.

Tatsinnit commented 1 year ago

Based on my speculation that it's something to do with the az CLI cache, not the kubelogin one, I tried the following command on Windows 11 using CLI v2.39.0 (which @JiayangZhou was using):

az account get-access-token -o json --resource 6dae42f8-4368-4678-94ff-3960e28e3630

I was hoping this would reproduce the slow behaviour, but it didn't for me (the kubelogin/kubectl commands all run reasonably fast too so there's something different with my setup).

@JiayangZhou, @AndersClausen, @Eneuman, @abhilash-pulluri, @jeroenlandheer, @RagingTonberry: would you be able to check whether the above command is also slow for you, and if so which az CLI version you're running?

💡 Thanks @peterbom , and thanks everyone who are involved, this is super interesting, so I wonder if this is kubelogin problem at all?

‼️ if anyone can tell us whether the slowness can be reproduced using this command (which kubelogin is calling to get the access token for the Kubernetes service AAD server), that would really help us narrow down where the issue is.

az account get-access-token -o json --resource 6dae42f8-4368-4678-94ff-3960e28e3630

💡 How can we consistently replicate this, if anyone has consistent replication for this please?

Prima facie, the more we dig into seems like login mode aka could be azcli is the issue but we cannot replicate this for sure at the moment. Looking forward for the folks here think 💭 thank you so much in advance, and fyi and cc: to @weinong + @peterbom

RagingTonberry commented 1 year ago

For me, I don't see any issues with az which i used with some frequency.

PS> Measure-Command { az account get-access-token -o json --resource 6dae42f8-4368-4678-94ff-3960e28e3630 }

*snip*
Seconds           : 2
Milliseconds      : 176
*snip*

But I still see periodic delays with kubelogin as I mentioned above:

PS> Measure-Command { kubelogin -v }
Error: flag needs an argument: 'v' in -v
*snip*
Seconds           : 7
Milliseconds      : 373
*snip*

PS> Measure-Command { kubelogin -v }
Error: flag needs an argument: 'v' in -v
*snip*
Seconds           : 0
Milliseconds      : 74
*snip*

Given that I didn't even give it a valid command line still seems to me like it's -something- in kubelogin land. I would imagine (having not looked at the code) that supplying an invalid commandline should lead to it dropping straight out and not doing any useful processing, like relying on a calls to the az cli, or azure itself.

Edit: I see the same behavior when I supply --version. A 7 second delay roughly every 30s. Still really feels like smart screen or something like that to me personally. Might be the case in my instance.

peterbom commented 1 year ago

Thanks @RagingTonberry - That helped me realize I can reproduce it here. I get a smaller (just under 3s) delay every 30s. I had a look using Process Monitor and I get a "bad network path" event trying to connect to <mydomain>\PIPE\samr. This happens every time I run it, even when it's not slow, so Windows appears to be caching the failure result for 30s.

This network path is apparently the Security Account Manager (SAM) endpoint for RPC over SMB. I don't really understand anything I just wrote, but it's apparently used for enumerating domain user information.

There are a couple of similar issues in Docker for Windows (#1936 and #2131), and a nasty workaround involving adding a hosts file entry for the domain name to 127.0.0.1 (which worked for me), but I still don't know why it would be trying to reach that endpoint.

@weinong - is there anything in the kubelogin code that might be calling a SAM library to retrieve user info, even when just running kubelogin --version?

RagingTonberry commented 1 year ago

That might go some way to explaining my presentation then, since I'm a remote worker with a domain joined machine and I'm rarely VPN'd (and even then there's some pretty decent delays because yay SMB).

weinong commented 1 year ago

@peterbom do you mean kubelogin --version in particular, or with any flag like kubelogin --foobar?

weinong commented 1 year ago

either way, kubelogin doesn't do anything besides providing the version string in the root command. So I guess it's coming from Cobra/spf13?

peterbom commented 1 year ago

@weinong: Not that specific command, no - it's anything. Actually it appears to be klog, not Cobra. In its initialization code it's calling user.Current(), which ends up hitting that endpoint on Windows. Looks like that's changed in v2 of that library. I'll create a PR.

weinong commented 1 year ago

excellent find!

weinong commented 1 year ago

https://github.com/Azure/kubelogin/releases/tag/v0.0.31 is out. please give it a try!

RagingTonberry commented 1 year ago

That seems to have solved my presentation;

PS > while ($true) { (Measure-Command { .\kubelogin.exe --version }).TotalSeconds ; Start-Sleep -Seconds 10 }
0.0415748
0.0390878
0.0352182
0.0364319
0.0398296
0.0449192

And since I first noticed it from kubectl:

PS > while ($true) { (Measure-Command { k get namespace }).TotalSeconds ; Start-Sleep -Seconds 10 }
1.4390334
0.2734661
0.3758186
0.2577303
0.2748371
0.2721687
0.2929428

Clearly a token/auth of some sort at the start, and then as expected.

Great work, everyone!

Tatsinnit commented 1 year ago

That seems to have solved my presentation;

PS > while ($true) { (Measure-Command { .\kubelogin.exe --version }).TotalSeconds ; Start-Sleep -Seconds 10 }
0.0415748
0.0390878
0.0352182
0.0364319
0.0398296
0.0449192

And since I first noticed it from kubectl:

PS > while ($true) { (Measure-Command { k get namespace }).TotalSeconds ; Start-Sleep -Seconds 10 }
1.4390334
0.2734661
0.3758186
0.2577303
0.2748371
0.2721687
0.2929428

Clearly a token/auth of some sort at the start, and then as expected.

Great work, everyone!

Beautiful and thank you so much for double checking!! Huge congrats to all who contributed and super awesomeness of everyone in this thread and @peterbom ❤️🎉 and @weinong ❤️🎉☕️

Shall we close this now.

JiayangZhou commented 1 year ago

Thank you for the finding and fix! It has resolved the slowness seen in my tests too. Did a comparison between the latest kubelogin release containing this fix and previous kubelogin without the fix, the result looks very promising and steady, (note it is just a simple script to calculate the timespan of multiple kubectl commands)

kubelogin version, Platform: Windows/amd64 | time spent in sec | time spent in sec | time spent  in sec | time spent  in sec | time spent  in sec | time spent in sec | time spent in sec -- | -- | -- | -- | -- | -- | -- | -- after GitHub fix(kubelogin version  v0.0.31) | 14 | 13 | 12 | 13 | 14 | 13 | 13 before GitHub fix(kubelogin version  v0.0.30) | 37 | 35 | 13 | 16 | 36 | 13 | 36