Authorisation problem when deploying training pipeline

andrewblance commented 1 year ago

Hello,

I have been following your quick start guide, and have got to the stage where I need to deploy the pipeline "deploy-model-training-pipeline.yml" on Azure DevOps.

When I run this, it goes as far as the Run pipeline in AML step in DevOps, then I get this error:

If there is an Authorization error, check your Azure KeyVault secret named kvmonitoringspkey. Terraform might put single quotation marks around the secret. Remove the single quotes and the secret should work.
.create table mlmonitoring (['Sno']: int, ['Age']: int, ['Sex']: string, ['Job']: int, ['Housing']: string, ['Saving accounts']: string, ['Checking account']: string, ['Credit amount']: int, ['Duration']: int, ['Purpose']: string, ['Risk']: string, ['timestamp']: datetime)
Cleaning up all outstanding Run operations, waiting 300.0 seconds
2 items cleaning up...
Cleanup took 0.08368873596191406 seconds
Traceback (most recent call last):
  File "/azureml-envs/XXXXXX/lib/python3.7/site-packages/azure/kusto/data/security.py", line 68, in acquire_authorization_header
    return _get_header_from_dict(self.token_provider.get_token())
  File "/azureml-envs/XXXXXX/lib/python3.7/site-packages/azure/kusto/data/_token_providers.py", line 123, in get_token
    token = self._get_token_impl()
  File "/azureml-envs/XXXXXX/lib/python3.7/site-packages/azure/kusto/data/_token_providers.py", line 554, in _get_token_impl
    return self._valid_token_or_throw(token)
  File "/azureml-envs/XXXXXXX/lib/python3.7/site-packages/azure/kusto/data/_token_providers.py", line 201, in _valid_token_or_throw
    raise KustoClientError(message)
azure.kusto.data.exceptions.KustoClientError: ApplicationKeyTokenProvider - failed to obtain a token. 
invalid_client
AADSTS7000215: Invalid client secret provided. Ensure the secret being sent in the request is the client secret value, not the client secret ID, for a secret added to app 'XXXXX'.
Trace ID: XXXXXX
Correlation ID: XXXXXXX
Timestamp: 2022-11-01 15:46:31Z

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "prep.py", line 83, in <module>
    main()
  File "prep.py", line 80, in main
    log_training_data(df, args.table_name)
  File "prep.py", line 35, in log_training_data
    collector.batch_collect(df)
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/obs/collector.py", line 158, in batch_collect
    self.create_table_and_mapping()
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/obs/collector.py", line 132, in create_table_and_mapping
    self.kusto_client.execute_mgmt(self.database_name, CREATE_TABLE_COMMAND)
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/azure/kusto/data/client.py", line 891, in execute_mgmt
    return self._execute(self._mgmt_endpoint, database, query, None, self._mgmt_default_timeout, properties)
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/azure/kusto/data/client.py", line 959, in _execute
    request_headers["Authorization"] = self._aad_helper.acquire_authorization_header()
  File "/azureml-envs/XXXXX/lib/python3.7/site-packages/azure/kusto/data/security.py", line 72, in acquire_authorization_header
    raise KustoAuthenticationError(self.token_provider.name(), error, **kwargs)
azure.kusto.data.exceptions.KustoAuthenticationError: KustoAuthenticationError('ApplicationKeyTokenProvider', 'KustoClientError("ApplicationKeyTokenProvider - failed to obtain a token. \ninvalid_client\nAADSTS7000215: Invalid client secret provided. Ensure the secret being sent in the request is the client secret value, not the client secret ID, for a secret added to app 'XXXXXXX'.\r\nTrace ID: XXXXXXXX\r\nTimestamp: 2022-11-01 15:46:31Z")', '{'authority': 'XXXXXX', 'client_id': 'XXXXX', 'kusto_uri': 'https://adxmlopsv286309prod.uksouth.kusto.windows.net'}')

I have done some investigating:

It appears my Prod Service Principal is set up correctly. When I go to Project Settings > Service Connections > Azure-ARM-Prod > Edit, there is an option to verify the connection. It works here.
When I check my App Registrations and investigate the "Certificates and Secrets tab" of "Azure-ARM-Prod-mlops-sparse" there is indeed a secret in here. (The terraform pipeline to create the Prod infrastructure works, so I am lead to believe that my SPs are working correctly.)
When I go to my Prod Key Vault, there is a secret called kvmonitoringspkey - looking at this though the Secret Value is just $(CLIENT_SECRET) - is it meant to be this? If so, why? And where was is set to this?

Do you have any advice on how I can fix this error?

setuc commented 1 year ago

@nicoleserafino Could you help us on this issue?

andrewblance commented 1 year ago

have I potentially missed a step (or is the documentation currently missing a step?) when you have to declare the variable?

Nothing is declared in the data-explorer or key-vault modules, and secrets aren't discussed in either run-terraform-apply.yml or tf-ado-deploy-infra.yml. Also, there aren't any variables in the pipeline:

Am I missing something obvious maybe? How is this secret being passed to $(CLIENT_SECRET)?

Edited to add more information:

I have everything stored within Azure DevOps, nothing is in Github
I have tried deleting the resource group, purging the key vault, and then remaking the SPs - I still get this error
I'm not entirely sure what is the route of this issue, but I suspect that having data explorer is part of it. I noted that kvmonitoringspkey is created when monitoring is enabled, and is all taken care of within data-explorer/main.tf. Kinda relevant to this: why is app insights also created in dev? And why, when it is created, a data explorer is not made alongside it?

setuc commented 1 year ago

@andrewblance Let me review this. You are right that this issue is related to the monitoring via the Data Explorer. If you don't need the monitoring or data drift, please disable the setup of the Azure Data Explorer. If you want to make it work, then you will need to setup _clientsecret. We haven't documented that step correctly either. Let us review the steps and document those, so we can show how to correctly pass the secret to the config file.

cindyweng commented 1 year ago

@andrewblance You can find the client secret being created in this template: https://github.com/Azure/mlops-templates/blob/main/templates/infra/create-sp-variables.yml

$(CLIENT_SECRET) becomes an environment variable in the ADO agent when you run the infra pipeline (see line 50 here: https://github.com/Azure/mlops-project-template/blob/main/infrastructure/terraform/pipelines/tf-ado-deploy-infra.yml)

However, I am not able to reproduce your problem :( When I run the pipeline and then look in key vault, I'm able to see a secret that's 56 characters long...

There is a separate issue that we are working to solve. It turns out when you use Terraform to inject a secret into key vault, it puts single-quotes around the secret value, and I thought that might be why your authentication is failing... We are going to migrate the key management to use AZ CLI instead because of this, but you'll have to bear with us while we push out the changes.

Can you try running the sparse checkout again, with the aml-cli-v2 classical example and rerunning the terraform pipeline? We've made some changes recently to the IAC. Please check that "enable_monitoring" is set to true in the config-infra-prod.yml file, as we have it off by default in our main branch.

Finally, delete your unused ADX resources :) They've been known to cost a pretty penny...

andrewblance commented 1 year ago

@cindyweng Thank you so much! That was incredibly useful!

I think I found the cause - in the version of tf-ado-deploy-infra.yml I had I did not have the same line 50, and therefore was not calling the create-sp-variables template. When I originally cloned the file must have been a little different (looking at the git history suggest this may be the case, apologies). I added it in and reran sparse checkout. It works now!

I bumped into the single-quote issue you mentioned - but I created a new version of the key, and everything works now. Thank you for helping me here!

I note that some of the features between AZ CLI and the Python SDK are slightly different (there is no Python SDK pipeline to create an endpoint). I think I will try to make this myself (once I work through the drift monitoring pipelines), but I was curious if there was a reason the CLI version was created before the Python one - is the CLI Microsoft's suggested method of creating pipelines and interacting with AML in scenarios like this?

setuc commented 1 year ago

@andrewblance with regards to your question on why CLI was done first, it came down to the choice that we had to pursue. We decided to first focus on the CLI and get it ready so that a larger number of folks can use it, especially it doesn't require you to learn python. Once that is ready, we can work on the Python SDK example to get similar examples in the repo.

We are also working out the logistics to allow for contributions on the various ML methods or ideas on the improvement for the workflow. We don't want the repo to get so complicated that it loses its appeal.

Hope that helps. Happy to have a broader conversation with you in case you are interested to contribute.

Azure / mlops-v2

Authorisation problem when deploying training pipeline #66