Azure / azure-cli

Azure Command-Line Interface
MIT License
4.03k stars 3k forks source link

Corruption of `~/.azure` files #9427

Open jlpedrosa opened 5 years ago

jlpedrosa commented 5 years ago

Describe the bug The files inside the .azure folder gets corrupted

To Reproduce IT's a race condition, so it's really diffictult to reproduce. I was using terraform, that uses the azure go sdk. at the same time I was running az aks list to see when the resource would come online. az cli failed with the following message.

az aks list
Failed to load token files. If you have a repro, please log an issue at https://github.com/Azure/azure-cli/issues. At the same time, you can clean up by running 'az account clear' and then 'az login'. (Inner Error: Failed to parse /Users/jopedros/.azure/accessTokens.json with exception:
    Extra data: line 1 column 16946 (char 16945))

Opening the file manually, I could see in the token array, there was an extra set of braces

}]

Manually correcting the file solved the issue.

Environment summary az --version azure-cli 2.0.63 *

osX, bash.

yugangw-msft commented 5 years ago

This is a known issue. Meantime, the workaround is to use Azure_CONFIG_DIR to isolate CLI into its own sandbox

dominik-lekse commented 5 years ago

I observe the same problem when using Terraform with fetching credentials from Azure CLI on macOS (https://www.terraform.io/docs/providers/azurerm/auth/azure_cli.html).

I do not think this is a problem within the Terraform Azure provider.

yonzhan commented 4 years ago

add to S164.

jpmsilva commented 4 years ago

I think we are also facing this issue. We are using Apache Airflow to orchestrate environments in Azure, and we resort to the az cli to interface with Azure. Because of the nature of the workflows we run, multiple az instances may be triggered - which is desirable, as we want to paralyze as much as possible. Occasionally we will get this error, and the accessTokens.json will become corrupt with extra characters. I think it's because there are multiple writers on the same file, and the last writer writes less bytes than the previous, resulting in extra bits at the end.

For the time being we are working around that by wrapping the az command with the following bash script:

#!/bin/bash
cleanup() {
  EXIT=$?; rm -rf "${WORKING}"; exit ${EXIT}
}
trap cleanup SIGHUP SIGINT SIGQUIT SIGABRT SIGPIPE SIGTERM
WORKING=$(mktemp -d --tmpdir azwrap.XXXXXX) || exit 1
if [ -z "${WORKING}" ]; then exit 1; fi
AZURE_CONFIG_DIR_ORIG=${AZURE_CONFIG_DIR:-~/.azure}
cp -rp "${AZURE_CONFIG_DIR_ORIG}/"* "${WORKING}"
AZURE_CONFIG_DIR="${WORKING}" az "$@"; EXIT=$?
[ -f "${WORKING}/accessTokens.json" ] && mv "${WORKING}/accessTokens.json" "${AZURE_CONFIG_DIR_ORIG}"
rm -rf "${WORKING}"
exit ${EXIT}

Hope it helps others

jkruis commented 4 years ago

I have encountered the same issue twice when running a .NET Core project (targeting netcoreapp3.1) from Visual Studio 2019 (16.4.2) on win10. No Azure orchestration was involved. Maybe this sheds some new light on the issue.

I am authentating with "az login" (azure-cli 2.0.77) and accessing a keyvault from the .NET Core project. At some seemingly arbitrary point my code stops working (in a manner that suggests expiration of a token), I try to "az login" again and it complains about extra data in (user dir)/.azure/accessTokens.json, same as in the original report. Removing the extra chars fixes the issue.

My extra data started at column 17991 and was 5 characters long, which suggests a pattern of the corrupt files being 8n+4 characters long.

invidian commented 4 years ago

I hit this issue many times as well. It's very annoying.

jiasli commented 4 years ago

Sorry for the inconvenience caused. Terraform azurerm provider also uses az account get-access-token internally (src):

err := jsonUnmarshalAzCmd(&token, "account", "get-access-token", "--resource", endpoint, "--subscription", subscriptionId, "-o=json")

According to https://stackoverflow.com/a/186464/2199657, there is no cross-platform way for file locking in Python built-in libraries. We will evaluate portalocker and see if we can incorporate it.

For now you may refer to https://docs.microsoft.com/en-us/cli/azure/use-cli-effectively#concurrent-builds for concurrent executions of Azure CLI.

panmanphil commented 4 years ago

I am seeing this now with the latest visual studio 2019, 16.6.2 and .net core 3.1. I noticed every day the first time I tried launch a project in visual studio, though if I clear the account, az account clear and then use a device code login I'll be ok for the day. If I remove the characters at the position shown in the error to then end of the file, the tokens are still valid

jiasli commented 4 years ago

I am seeing this now with the latest visual studio 2019, 16.6.2 and .net core 3.1. I noticed every day the first time I tried launch a project in visual studio, though if I clear the account, az account clear and then use a device code login I'll be ok for the day. If I remove the characters at the position shown in the error to then end of the file, the tokens are still valid

Hi @panmanphil, to help us better understand your issue,

  1. Could you elaborate how you are using Azure CLI in Visual Studio?
  2. If you are seeing a corrupted file, could you share the name and skeleton of the file? Please remove sensitive information.
panmanphil commented 4 years ago

Part one of the answer is that I first notice that the file is corrupted because within visual stdio, when I start up a debug session, the normal startup that gets a token for keyvault with my identity instead of the managed service identity in azure is unable to load the the file from ~/.azure /accessTokens.json due to a newtonsoft.json error. Then I go to the command line and type az login and get the message this issue refers too.

I haven't gotten the error today, I'll upload a redacted copy of the json file the next time I see it happen.

jiasli commented 4 years ago

I am sorry, but I still don't quite get how you are using Azure CLI. Are you directly reading ~/.azure/accessTokens.json or sub-processing with az account get-access-token?

Anyway, both case may trigger this issue. We will try to use portalocker to avoid it.

panmanphil commented 4 years ago

az login uses these tokens directly, no? I have to use az account clear to be able to login with either az login or with visual studio. I don't know what tool caused the corruption as I use both the command line and visual studio freely during the day. Here is an example of a corrupted file. the last few bytes are the problem

[{"tokenType": "Bearer", "expiresIn": 3599, "expiresOn": "2020-06-18 16:49:25.952665", "resource": "https://management.core.windows.net/", "accessToken": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "refreshToken": "xxxxxxxxxxxxxxxxxxxxxxxxxxx", "oid": "xxxxxxx", "userId": "xxxxxx@xxxxxx.onmicrosoft.com", "isMRRT": true, "_clientId": "xxxxxxx", "_authority": "https://login.microsoftonline.com/common"}, {"tokenType": "Bearer", "expiresIn": 3599, "expiresOn": "2020-06-19 14:38:14.368807", "resource": "https://management.core.windows.net/", "accessToken": "xxxxxxxxxxxxxxxxx", "refreshToken": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "oid": "xxxxxxxxx", "userId": "xxxxxxxx@xxxxxxxx.onmicrosoft.com", "isMRRT": true, "_clientId": "xxxxxxxxxxxxxxxxxxxxxx", "_authority": "https://login.microsoftonline.com/xxxxxxxxxxxxxxxxxxxxx"}, {"tokenType": "Bearer", "expiresIn": 3599, "expiresOn": "2020-06-22 17:42:47.852873", "resource": "https://database.windows.net/", "accessToken": "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx", "refreshToken": "xxxxxxxxxxxxxxxxxxxxxxxxxx", "oid": "xxxxxxx", "userId": "xxxxxx@xxxx.onmicrosoft.com", "isMRRT": true, "_clientId": "xxxxxxxx", "_authority": "https://login.microsoftonline.com/xxxxxxxxxxxxxxxxx"}, {"tokenType": "Bearer", "expiresIn": 3599, "expiresOn": "2020-06-22 21:49:06.397180", "resource": "https://vault.azure.net", "accessToken": "xxxxxxxxxxxx", "refreshToken": "xxxxxxx", "userId": "xxxxxx@xxxxx.onmicrosoft.com", "isMRRT": true, "_clientId": "xxxxxxxxx", "_authority": "https://login.microsoftonline.com/xxx"}]7d"}]
jiasli commented 4 years ago

Visual Studio doesn't have any direct integration with Azure CLI, and don't share the credential cache with Azure CLI. I guess you are using some plugin to make it work?

az login does updates those tokens. If 2 instances of Azure CLI are running concurrently, this issue will happen. Please make sure there is no background Azure CLI running. If so please set AZURE_CONFIG_DIR environment variable following https://docs.microsoft.com/en-us/cli/azure/use-cli-effectively#concurrent-builds to isolation each instance for now.

panmanphil commented 4 years ago

We are using the azure cli to login with this library to aid in local development with managed service identities. Typically the login is once per day, not concurrent. https://docs.microsoft.com/en-us/azure/key-vault/general/service-to-service-authentication#local-development-authentication

jiasli commented 4 years ago

@panmanphil, thanks for the detailed information. Not exactly az login, but the command az account get-access-token is run concurrently by AzureCliAccessTokenProvider:

https://github.com/Azure/azure-sdk-for-net/blob/5d331813d381c133cb50a4f9214b4c901bd133a4/sdk/mgmtcommon/AppAuthentication/Azure.Services.AppAuthentication/TokenProviders/AzureCliAccessTokenProvider.cs#L25

        private const string GetTokenCommand = "az account get-access-token -o json";

As you can see from the token, it usually expires in 1 hour, so your issue should happen 1 hour after the login, instead of 1 day.

"expiresIn": 3599,

When the token is going to expire, ADAL, used by Azure CLI, will refresh the cache:

https://github.com/AzureAD/azure-activedirectory-library-for-python/blob/6f0c4755658fbbacf50de684c16eb378d1dbfb92/adal/cache_driver.py#L166-L189

        if is_resource_specific and now_plus_buffer > expiry_date:
            if TokenResponseFields.REFRESH_TOKEN in entry:
                self._log.info('Cached token is expired at %(date)s.  Refreshing',
                               {"date": expiry_date})
                return self._refresh_expired_entry(entry)

Then CLI will save the ADAL cache to a JSON file. When there are two concurrent invocations of Azure CLI, conflicts will occur for the token cache.

Could you check if your client application is multi-threaded and two thread are calling AzureServiceTokenProvider concurrently?

panmanphil commented 4 years ago

Starting to make more sense now. Though the mechanics of threading in a .net core web app are little vague to me, I'd say yes it's multithreaded. Probably a bigger factor is that we run multiple web apps and functions locally at the same time, a service oriented architecture and any of the them could trigger the refresh token logic you found.

jiasli commented 4 years ago

We currently have a beta release for Windows which uses MSAL as the authentication library. It already contains the file locking mechanism. But we haven't fully tested it yet. (Will test it in #14070.)

You may want to give it a try: https://docs.microsoft.com/en-us/cli/azure/install-azure-cli-windows?view=azure-cli-latest&tabs=azure-cli

qwordy commented 4 years ago

I keep getting this error.

Failed to load token files. If you have a repro, please log an issue at https://github.com/Azure/azure-cli/issues. At the same time, you can clean up by running 'az account clear' and then 'az login'. (Inner Error: Failed to parse C:\Users\fey\.azure\accessTokens.json with exception:
    Extra data: line 1 column 15818 (char 15817))
qwordy commented 4 years ago

Any plan to fix it?

nfx commented 3 years ago

subscribing to updates

invidian commented 3 years ago

@nfx there is a subscribe button in the right column in the web UI, which you can use to avoid notifying existing subscribers 😉