Azure / azure-cli

Azure Command-Line Interface
MIT License
4.02k stars 2.99k forks source link

[BUG] AzureCliCredential fails intermittently on devops pipelines with CredentialUnavailableError because of timed out after 10 seconds #20404

Open harneetvirk opened 2 years ago

harneetvirk commented 2 years ago

Describe the bug AzureCliCredential fails intermittently on devops pipelines with CredentialUnavailableError because of timed out after 10 seconds, and resulting in the build failures.

'['cmd', '/c', 'az account get-access-token --output json --resource https://management.azure.com']' timed out after 10 seconds

To Reproduce Intermittent CredentialUnavailableError in DevOps pipelines.

Expected behavior Some random Az CLI tasks should not fail with an intermittent CredentialUnavailableError

Environment summary [command]C:\Windows\system32\cmd.exe /D /S /C ""C:\Program Files (x86)\Microsoft SDKs\Azure\CLI2\wbin\az.cmd" --version"

azure-cli 2.30.0 Please let us know how we are doing: https://aka.ms/azureclihats

and let us know if you're interested in trying out our newest features: https://aka.ms/CLIUXstudy core 2.30.0 telemetry 1.0.6

Extensions: azure-devops 0.22.0

Additional context Logs from build agent: 2021-11-11T07:56:03.1173758Z 2021-11-11T07:56:03.1174605Z self = <azure.identity._credentials.chained.ChainedTokenCredential object at 0x0000023C52710BA8> 2021-11-11T07:56:03.1175442Z scopes = ('https://management.azure.com/.default',), kwargs = {} 2021-11-11T07:56:03.1176625Z history = [(<azure.identity._credentials.azure_cli.AzureCliCredential object at 0x0000023C5275C550>, 'Failed to invoke the Azure...veBrowserCredential object at 0x0000023C52732828>, 'Timed out after waiting 300 seconds for the user to authenticate')] 2021-11-11T07:56:03.1177864Z credential = <azure.identity._credentials.browser.InteractiveBrowserCredential object at 0x0000023C52732828> 2021-11-11T07:56:03.1179086Z attempts = '\nAttempted credentials:\n\tAzureCliCredential: Failed to invoke the Azure CLI\n\tInteractiveBrowserCredential: Timed out after waiting 300 seconds for the user to authenticate' 2021-11-11T07:56:03.1180830Z message = 'ChainedTokenCredential failed to retrieve a token from the included credentials.\nAttempted credentials:\n\tAzureCliC...er to the troubleshooting guidelines here at https://aka.ms/azsdk/python/identity/defaultazurecredential/troubleshoot.'

2021-11-11T07:56:03.1426711Z ------------------------------ Captured log call ------------------------------ 2021-11-11T07:56:03.1427773Z WARNING  azure.identity._internal.decorators:decorators.py:41 AzureCliCredential.get_token failed: Failed to invoke the Azure CLI 2021-11-11T07:56:03.1428786Z Traceback (most recent call last): 2021-11-11T07:56:03.1429550Z File "c:\miniconda\envs\51333213\lib\subprocess.py", line 425, in run 2021-11-11T07:56:03.1430721Z stdout, stderr = process.communicate(input, timeout=timeout) 2021-11-11T07:56:03.1431643Z File "c:\miniconda\envs\51333213\lib\subprocess.py", line 863, in communicate 2021-11-11T07:56:03.1432446Z stdout, stderr = self._communicate(input, endtime, timeout) 2021-11-11T07:56:03.1433238Z File "c:\miniconda\envs\51333213\lib\subprocess.py", line 1114, in _communicate 2021-11-11T07:56:03.1434006Z raise TimeoutExpired(self.args, orig_timeout) 2021-11-11T07:56:03.1435013Z subprocess.TimeoutExpired: Command '['cmd', '/c', 'az account get-access-token --output json --resource https://management.azure.com']' timed out after 10 seconds 2021-11-11T07:56:03.1436106Z During handling of the above exception, another exception occurred: 2021-11-11T07:56:03.1436870Z Traceback (most recent call last): 2021-11-11T07:56:03.1437705Z File "c:\miniconda\envs\51333213\lib\site-packages\azure\identity\_credentials\azure_cli.py", line 141, in _run_command 2021-11-11T07:56:03.1438575Z return subprocess.check_output(args, **kwargs) 2021-11-11T07:56:03.1439366Z File "c:\miniconda\envs\51333213\lib\subprocess.py", line 356, in check_output 2021-11-11T07:56:03.1440070Z **kwargs).stdout 2021-11-11T07:56:03.1440895Z File "c:\miniconda\envs\51333213\lib\subprocess.py", line 430, in run 2021-11-11T07:56:03.1441595Z stderr=stderr) 2021-11-11T07:56:03.1442515Z subprocess.TimeoutExpired: Command '['cmd', '/c', 'az account get-access-token --output json --resource https://management.azure.com']' timed out after 10 seconds 2021-11-11T07:56:03.1443608Z The above exception was the direct cause of the following exception: 2021-11-11T07:56:03.1444363Z Traceback (most recent call last): 2021-11-11T07:56:03.1445190Z File "c:\miniconda\envs\51333213\lib\site-packages\azure\identity\_internal\decorators.py", line 30, in wrapper 2021-11-11T07:56:03.1445991Z token = fn(*args, **kwargs) 2021-11-11T07:56:03.1446820Z File "c:\miniconda\envs\51333213\lib\site-packages\azure\identity\_credentials\azure_cli.py", line 73, in get_token 2021-11-11T07:56:03.1447630Z output = _run_command(command) 2021-11-11T07:56:03.1448621Z File "c:\miniconda\envs\51333213\lib\site-packages\azure\identity\_credentials\azure_cli.py", line 162, in _run_command 2021-11-11T07:56:03.1449429Z six.raise_from(error, ex) 2021-11-11T07:56:03.1450057Z File "", line 3, in raise_from 2021-11-11T07:56:03.1450965Z azure.identity._exceptions.CredentialUnavailableError: Failed to invoke the Azure CLI 2021-11-11T07:56:03.1452689Z INFO  azure.core.pipeline.policies.http_logging_policy:_universal.py:475 Request URL: 'https://login.microsoftonline.com/organizations/v2.0/.well-known/openid-configuration'/nRequest method: 'GET'/nRequest headers:/n 'User-Agent': 'azsdk-python-identity/1.7.1 Python/3.6.8 (Windows-10-10.0.17763-SP0) VSTS_8b119ea1-2e2a-4839-8db7-8c9e8d50f6fa_build_13339_0'/nNo body was attached to the request 2021-11-11T07:56:03.1454520Z DEBUG  urllib3.connectionpool:connectionpool.py:975 Starting new HTTPS connection (1): login.microsoftonline.com:443 2021-11-11T07:56:03.1456088Z DEBUG  urllib3.connectionpool:connectionpool.py:461 https://login.microsoftonline.com:443 "GET /organizations/v2.0/.well-known/openid-configuration HTTP/1.1" 200 1589 2021-11-11T07:56:03.1458835Z INFO  azure.core.pipeline.policies.http_logging_policy:_universal.py:503 Response status: 200/nResponse headers:/n 'Cache-Control': 'max-age=86400, private'/n 'Content-Type': 'application/json; charset=utf-8'/n 'Strict-Transport-Security': 'REDACTED'/n 'X-Content-Type-Options': 'REDACTED'/n 'Access-Control-Allow-Origin': 'REDACTED'/n 'Access-Control-Allow-Methods': 'REDACTED'/n 'P3P': 'REDACTED'/n 'x-ms-request-id': 'cef116c9-1073-4ad3-861c-b643b8245d00'/n 'x-ms-ests-server': 'REDACTED'/n 'Set-Cookie': 'REDACTED'/n 'Date': 'Thu, 11 Nov 2021 07:43:14 GMT'/n 'Content-Leng
yonzhan commented 2 years ago

@jiasli for awareness

jiasli commented 2 years ago

The log seems to be trimmed. Could you share the full log please?

How long does it usually take to call

az account get-access-token --output json --resource https://management.azure.com

?

harneetvirk commented 2 years ago

@jiasli : I have shared the link of the logs to you on Teams.

yatamlite commented 2 years ago

We are running into this everyday since we've upgraded to 2.30.0. Like the submitter pointed out, this is wrecking our CI/CD pipeline jobs.

Environment : RHEL 7/8.

$ time az account get-access-token --output json --resource https://management.azure.com/
{
 ...
}

real    0m13.078s

For temporary relief, is there a way to increase the timeout for getting the token?

katekimani commented 2 years ago

@yatamlite I got around this by adding an override to the installed package code on the Docker image I use it in.

  1. First confirmed that the file only has 1 instance of the timeout value in the code (annoying its not set as a variable in v2.29.2) cat /usr/local/lib/python3.7/dist-packages/azure/identity/_credentials/azure_cli.py | grep -c 10

  2. Used sed to replace it with my desired timeout of 30secs RUN sed -i 's/10/30' /usr/local/lib/python3.7/dist-packages/azure/identity/_credentials/azure_cli.py

YingChen1996 commented 1 year ago

@jiasli We're running into the issue in our runner and file ticket. Is there ETA to fix this bug? {'azure-ai-ml': '1.4.0', 'mldesigner': '0.1.0b11', 'az': '{\n  \"azure-cli\": \"2.45.0\",\n  \"azure-cli-core\": \"2.45.0\",\n  \"azure-cli-telemetry\": \"1.0.8\",\n  \"extensions\": {\n    \"databricks\": \"0.9.0\",\n    \"ml\": \"2.14.0\"\n  }\n}\n', 'python': '3.8.13'}"

subprocess.TimeoutExpired: Command '['/bin/sh', '-c', 'az account get-access-token --output json --resource https://management.azure.com']' timed out after 10 seconds

The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/tests/scenarios/sdk-cli-v2/runners/sdk/utils.py", line 45, in log_wrapper res = fun(*args, kwargs) File "/tests/scenarios/sdk-cli-v2/runners/sdk/test_sdk_runner.py", line 157, in test_command_component self.pipeline_submit(pipeline) File "/tests/scenarios/sdk-cli-v2/runners/sdk/test_sdk_runner.py", line 84, in pipeline_submit pipeline_job = self.client.jobs.create_or_update( File "/opt/conda/lib/python3.8/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer return func(*args, *kwargs) File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_job_operations.py", line 562, in create_or_update raise ex File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_job_operations.py", line 498, in create_or_update self._resolve_arm_id_or_upload_dependencies(job) File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_job_operations.py", line 842, in _resolve_arm_id_or_upload_dependencies self._resolve_arm_id_or_azureml_id(job, self._orchestrators.get_asset_arm_id) File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_job_operations.py", line 1064, in _resolve_arm_id_or_azureml_id job = self._resolve_arm_id_for_pipeline_job(job, resolver) File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_job_operations.py", line 1179, in _resolve_arm_id_for_pipeline_job self._component_operations._resolve_dependencies_for_pipeline_component_jobs( File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_component_operations.py", line 719, in _resolve_dependencies_for_pipeline_component_jobs layers = self._divide_nodes_to_resolve_into_layers( File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_component_operations.py", line 674, in _divide_nodes_to_resolve_into_layers extra_operation(job_instance, key) File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_component_operations.py", line 480, in _try_resolve_environment_for_component component.environment = resolver( File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_operation_orchestrator.py", line 218, in get_asset_arm_id result = self._get_environment_arm_id(asset, register_asset=register_asset) File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_operation_orchestrator.py", line 287, in _get_environment_arm_id env_response = self._environments.create_or_update(environment) File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_environment_operations.py", line 144, in create_or_update raise ex File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/operations/_environment_operations.py", line 128, in create_or_update else self._version_operations.create_or_update( File "/opt/conda/lib/python3.8/site-packages/azure/core/tracing/decorator.py", line 78, in wrapper_use_tracer return func(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/azure/ai/ml/_restclient/v2022_05_01/operations/_environment_versions_operations.py", line 510, in create_or_update pipeline_response = self._client._pipeline.run(request, stream=False, kwargs) File "/opt/conda/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 205, in run return first_node.send(pipeline_request) # type: ignore File "/opt/conda/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 69, in send response = self.next.send(request) File "/opt/conda/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 69, in send response = self.next.send(request) File "/opt/conda/lib/python3.8/site-packages/azure/core/pipeline/_base.py", line 69, in send response = self.next.send(request) [Previous line repeated 2 more times] File "/opt/conda/lib/python3.8/site-packages/azure/mgmt/core/policies/_base.py", line 47, in send response = self.next.send(request) File "/opt/conda/lib/python3.8/site-packages/azure/core/pipeline/policies/_redirect.py", line 160, in send response = self.next.send(request) File "/opt/conda/lib/python3.8/site-packages/azure/core/pipeline/policies/_retry.py", line 474, in send response = self.next.send(request) File "/opt/conda/lib/python3.8/site-packages/azure/core/pipeline/policies/_authentication.py", line 115, in send self.on_request(request) File "/opt/conda/lib/python3.8/site-packages/azure/core/pipeline/policies/_authentication.py", line 92, in on_request self._token = self._credential.get_token(self._scopes) File "/opt/conda/lib/python3.8/site-packages/azure/identity/_internal/decorators.py", line 32, in wrapper token = fn(args, kwargs) File "/opt/conda/lib/python3.8/site-packages/azure/identity/_credentials/azure_cli.py", line 80, in get_token output = _run_command(command) File "/opt/conda/lib/python3.8/site-packages/azure/identity/_credentials/azure_cli.py", line 169, in _run_command six.raise_from(error, ex) File "", line 3, in raise_from azure.identity._exceptions.CredentialUnavailableError: Failed to invoke the Azure CLI

alexpovel commented 1 year ago

I am also affected by this. Excuse the Python-specific stuff, I realise we're in the CLI repo!

Consider this example snippet:

import os

from azure.identity.aio import DefaultAzureCredential
from azure.storage.blob.aio import BlobServiceClient

class AzureBlobStorage:
    def __init__(
        self,
        credential: DefaultAzureCredential | None = None,
    ) -> None:
        if credential is None:
            credential = DefaultAzureCredential()
        self.credential = credential

    async def do_things(self) -> None:
        async with self.credential as credential:
            async with BlobServiceClient(
                account_url=os.environ["AZURE_STORAGE_ACCOUNT_URL"],
                credential=credential,
            ) as bsc:
                async for container in bsc.list_containers():
                    print(container)

It's a dummy example for the concept of some class with associated credentials (DefaultAzureCredentials is instantiated only once!), doing some operation. Doesn't really matter. I followed roughly this guide.

But it turns out, instantiating DefaultAzureCredentials doesn't really... do anything, if I see right? It mainly sets up the whole credential chain with all required data, but doesn't call any operation just yet.

As soon as BlobServiceClient requires a token, get_token is called. The class remembers which credential is/was "successful":

https://github.com/Azure/azure-sdk-for-python/blob/cdb4cad5d561c61a569279d51f070c969f8fa23f/sdk/identity/azure-identity/azure/identity/_credentials/default.py#L218-L223

Running this locally, however, the az CLI is used, as is intended, but getting a token from it seems to involve spawning a whole new Python interpreter per get_token invocation? (didn't dig into this too much) It absolutely eats away CPU for what is fundamentally a simple operation, you'd think (if anything, I/O-bound, not CPU heavy):

image

And remembering the credential is helpful, but it doesn't cache the token value. Just the token provider (credential).

I am calling do_things highly concurrently. Nothing in the docs indicate that's a bad idea, but I am starting to think otherwise... I haven't researched this properly, but I can only suspect some of those Python interpreters are put on a diet by the overburdened CPU, timing them out, leading to the issue people already mentioned in this thread. The eventual stack trace is very similar, anyway, ending in:

|   File "/home/alex/project/.venv/lib/python3.11/site-packages/azure/identity/aio/_credentials/azure_cli.py", line 148, in _run_command
|     raise error from ex
| azure.identity._exceptions.CredentialUnavailableError: Failed to execute '/bin/sh'

So, failing to spawn a shell process. It's not an OOM issue, as can be seen from the screenshot, there's plenty left.

All this is just conjecture!

EDIT: I would be happy to cache the token value, but the design around get_token doesn't make that seem like an intended thing for developers to do.

preetesh33 commented 11 months ago

Even I faced similar issue, in long running python script.

error: AzureCliCredential.get_token failed: Failed to invoke the Azure CLI Failed to authenticate, VM test.example.com, error Failed to invoke the Azure CLI