Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.61k stars 2.82k forks source link

ClientAuthenticationError: Server failed to authenticate the request - When submitting a training job #31263

Closed waqassiddiqi closed 2 weeks ago

waqassiddiqi commented 1 year ago

Describe the bug Failed to submit training job when using Service Principal for authentication. Gave this service principal Contributor access to Azure ML workspace and, Storage Blob Data Contributor and Storage File Data SMB Share Contributor access to default workspace associated with ML workspace. Below is the code we are using to set SP as authentication method and to submit job:

os.environ["AZURE_CLIENT_ID"] = service_principal_id
os.environ["AZURE_TENANT_ID"] = tenant_id
os.environ["AZURE_CLIENT_SECRET"] = service_principal_secret

ml_client = MLClient(
    credential=DefaultAzureCredential(),
    subscription_id="subscription-id",
    resource_group_name="rg",
    workspace_name="workspace-dev",
)

job = command(
    experiment_name='nostra test',
    code="./",
    command="python main.py",
    environment="sklearn-1.0@latest",
    compute="local"
)

submitted_job = ml_client.create_or_update(job)

To Reproduce Steps to reproduce the behavior:

  1. Create a service principal via Azure portal
  2. Give service principal Storage Blob Data Contributor and Storage File Data SMB Share Contributor access to blob storage associated with ML workspace and Contributor access to ML workspace
  3. Submit job using Azure ML Python SDK v2

Expected behavior Training job should be submitted without any issue.

Screenshots Below is the complete stack trace:

Uploading container_1690160823017_0002_01_000001 (3.21 MBs):  23%|██▎       | 727760/3206409 [00:02<00:09, 259538.07it/s] 
---------------------------------------------------------------------------
ClientAuthenticationError                 Traceback (most recent call last)
Cell In [95], line 9
      1 job = command(
      2         experiment_name='nostra test',
      3         code="./",
   (...)
      6         compute="local"
      7     )
----> 9 submitted_job = ml_client.create_or_update(job)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_ml_client.py:903, in MLClient.create_or_update(self, entity, **kwargs)
    887 def create_or_update(
    888     self,
    889     entity: T,
    890     **kwargs,
    891 ) -> T:
    892     """Creates or updates an Azure ML resource.
    893 
    894     :param entity: The resource to create or update.
   (...)
    900         , ~azure.ai.ml.entities.Environment, ~azure.ai.ml.entities.Component, ~azure.ai.ml.entities.Datastore]
    901     """
--> 903     return _create_or_update(entity, self._operation_container.all_operations, **kwargs)

File ~/cluster-env/clonedenv/lib/python3.10/functools.py:889, in singledispatch.<locals>.wrapper(*args, **kw)
    885 if not args:
    886     raise TypeError(f'{funcname} requires at least '
    887                     '1 positional argument')
--> 889 return dispatch(args[0].__class__)(*args, **kw)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_ml_client.py:961, in _(entity, operations, **kwargs)
    958 @_create_or_update.register(Job)
    959 def _(entity: Job, operations, **kwargs):
    960     module_logger.debug("Creating or updating job")
--> 961     return operations[AzureMLResourceType.JOB].create_or_update(entity, **kwargs)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/core/tracing/decorator.py:76, in distributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer(*args, **kwargs)
     74 span_impl_type = settings.tracing_implementation()
     75 if span_impl_type is None:
---> 76     return func(*args, **kwargs)
     78 # Merge span is parameter is set, but only if no explicit parent are passed
     79 if merge_span and not passed_in_parent:

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_telemetry/activity.py:337, in monitor_with_telemetry_mixin.<locals>.monitor.<locals>.wrapper(*args, **kwargs)
    335 dimensions = {**parameter_dimensions, **(custom_dimensions or {})}
    336 with log_activity(logger, activity_name or f.__name__, activity_type, dimensions) as activityLogger:
--> 337     return_value = f(*args, **kwargs)
    338     if not parameter_dimensions:
    339         # collect from return if no dimensions from parameter
    340         activityLogger.activity_info.update(_collect_from_return_value(return_value))

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_job_operations.py:609, in JobOperations.create_or_update(self, job, description, compute, tags, experiment_name, skip_validation, **kwargs)
    607     log_and_raise_error(ex)
    608 else:
--> 609     raise ex

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_job_operations.py:541, in JobOperations.create_or_update(self, job, description, compute, tags, experiment_name, skip_validation, **kwargs)
    538     self._validate(job, raise_on_failure=True)
    540 # Create all dependent resources
--> 541 self._resolve_arm_id_or_upload_dependencies(job)
    543 git_props = get_git_properties()
    544 # Do not add git props if they already exist in job properties.
    545 # This is for update specifically-- if the user switches branches and tries to update
    546 # their job, the request will fail since the git props will be repopulated.
    547 # MFE does not allow existing properties to be updated, only for new props to be added

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_job_operations.py:890, in JobOperations._resolve_arm_id_or_upload_dependencies(self, job)
    880 def _resolve_arm_id_or_upload_dependencies(self, job: Job) -> None:
    881     """This method converts name or name:version to ARM id. Or it
    882     registers/uploads nested dependencies.
    883 
   (...)
    887     :rtype: Job
    888     """
--> 890     self._resolve_arm_id_or_azureml_id(job, self._orchestrators.get_asset_arm_id)
    892     if isinstance(job, PipelineJob):
    893         # Resolve top-level inputs
    894         self._resolve_pipeline_job_inputs(job, job._base_path)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_job_operations.py:1103, in JobOperations._resolve_arm_id_or_azureml_id(self, job, resolver)
   1101     job.compute = self._resolve_compute_id(resolver, job.compute)
   1102 elif isinstance(job, Command):
-> 1103     job = self._resolve_arm_id_for_command_job(job, resolver)
   1104 elif isinstance(job, ImportJob):
   1105     job = self._resolve_arm_id_for_import_job(job, resolver)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_job_operations.py:1140, in JobOperations._resolve_arm_id_for_command_job(self, job, resolver)
   1131     raise ValidationException(
   1132         message=msg.format(job.code),
   1133         target=ErrorTarget.JOB,
   (...)
   1136         error_type=ValidationErrorType.INVALID_VALUE,
   1137     )
   1139 if job.code is not None and not is_ARM_id_for_resource(job.code, AzureMLResourceType.CODE):
-> 1140     job.code = resolver(
   1141         Code(base_path=job._base_path, path=job.code),
   1142         azureml_type=AzureMLResourceType.CODE,
   1143     )
   1144 job.environment = resolver(job.environment, azureml_type=AzureMLResourceType.ENVIRONMENT)
   1145 job.compute = self._resolve_compute_id(resolver, job.compute)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_operation_orchestrator.py:232, in OperationOrchestrator.get_asset_arm_id(self, asset, azureml_type, register_asset, sub_workspace_resource)
    229 try:
    230     # TODO: once the asset redesign is finished, this logic can be replaced with unified API
    231     if azureml_type == AzureMLResourceType.CODE and isinstance(asset, Code):
--> 232         result = self._get_code_asset_arm_id(asset, register_asset=register_asset)
    233     elif azureml_type == AzureMLResourceType.ENVIRONMENT and isinstance(asset, Environment):
    234         result = self._get_environment_arm_id(asset, register_asset=register_asset)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_operation_orchestrator.py:298, in OperationOrchestrator._get_code_asset_arm_id(self, code_asset, register_asset)
    296     return uploaded_code_asset
    297 except (MlException, HttpResponseError) as e:
--> 298     raise e
    299 except Exception as e:
    300     raise AssetException(
    301         message=f"Error with code: {e}",
    302         target=ErrorTarget.ASSET,
   (...)
    305         error_category=ErrorCategory.SYSTEM_ERROR,
    306     )

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_operation_orchestrator.py:273, in OperationOrchestrator._get_code_asset_arm_id(self, code_asset, register_asset)
    271 self._validate_datastore_name(code_asset.path)
    272 if register_asset:
--> 273     code_asset = self._code_assets.create_or_update(code_asset)
    274     return code_asset.id
    275 sas_info = get_storage_info_for_non_registry_asset(
    276     service_client=self._code_assets._service_client,
    277     workspace_name=self._operation_scope.workspace_name,
   (...)
    280     resource_group=self._operation_scope.resource_group_name,
    281 )

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_telemetry/activity.py:263, in monitor_with_activity.<locals>.monitor.<locals>.wrapper(*args, **kwargs)
    260 @functools.wraps(f)
    261 def wrapper(*args, **kwargs):
    262     with log_activity(logger, activity_name or f.__name__, activity_type, custom_dimensions):
--> 263         return f(*args, **kwargs)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_code_operations.py:180, in CodeOperations.create_or_update(self, code)
    173     if str(ex) == ASSET_PATH_ERROR:
    174         raise AssetPathException(
    175             message=CHANGED_ASSET_PATH_MSG,
    176             target=ErrorTarget.CODE,
    177             no_personal_data_message=CHANGED_ASSET_PATH_MSG_NO_PERSONAL_DATA,
    178             error_category=ErrorCategory.USER_ERROR,
    179         )
--> 180 raise ex

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/operations/_code_operations.py:128, in CodeOperations.create_or_update(self, code)
    125         sas_uri = sas_info["sas_uri"]
    126         blob_uri = sas_info["blob_uri"]
--> 128 code, _ = _check_and_upload_path(
    129     artifact=code,
    130     asset_operations=self,
    131     sas_uri=sas_uri,
    132     artifact_type=ErrorTarget.CODE,
    133     show_progress=self._show_progress,
    134     blob_uri=blob_uri,
    135 )
    137 # For anonymous code, if the code already exists in storage, we reuse the name,
    138 # version stored in the storage metadata so the same anonymous code won't be created again.
    139 if code._is_anonymous:

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_artifacts/_artifact_utilities.py:404, in _check_and_upload_path(artifact, asset_operations, artifact_type, datastore_name, sas_uri, show_progress, blob_uri)
    402 if not path.is_absolute():
    403     path = Path(artifact.base_path, path).resolve()
--> 404 uploaded_artifact = _upload_to_datastore(
    405     asset_operations._operation_scope,
    406     asset_operations._datastore_operation,
    407     path,
    408     datastore_name=datastore_name,
    409     asset_name=artifact.name,
    410     asset_version=str(artifact.version),
    411     asset_hash=getattr(artifact, "_upload_hash", None),
    412     sas_uri=sas_uri,
    413     artifact_type=artifact_type,
    414     show_progress=show_progress,
    415     ignore_file=getattr(artifact, "_ignore_file", None),
    416     blob_uri=blob_uri,
    417 )
    418 indicator_file = uploaded_artifact.indicator_file  # reference to storage contents
    419 if artifact._is_anonymous:

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_artifacts/_artifact_utilities.py:300, in _upload_to_datastore(operation_scope, datastore_operation, path, artifact_type, datastore_name, show_progress, asset_name, asset_version, asset_hash, ignore_file, sas_uri, blob_uri)
    298 if not asset_hash:
    299     asset_hash = get_object_hash(path, ignore_file)
--> 300 artifact = upload_artifact(
    301     str(path),
    302     datastore_operation,
    303     operation_scope,
    304     datastore_name,
    305     show_progress=show_progress,
    306     asset_hash=asset_hash,
    307     asset_name=asset_name,
    308     asset_version=asset_version,
    309     ignore_file=ignore_file,
    310     sas_uri=sas_uri,
    311 )
    312 if blob_uri:
    313     artifact.storage_account_url = blob_uri

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_artifacts/_artifact_utilities.py:180, in upload_artifact(local_path, datastore_operation, operation_scope, datastore_name, asset_hash, show_progress, asset_name, asset_version, ignore_file, sas_uri)
    177     datastore_info = get_datastore_info(datastore_operation, datastore_name)
    178     storage_client = get_storage_client(**datastore_info)
--> 180 artifact_info = storage_client.upload(
    181     local_path,
    182     asset_hash=asset_hash,
    183     show_progress=show_progress,
    184     name=asset_name,
    185     version=asset_version,
    186     ignore_file=ignore_file,
    187 )
    189 artifact = ArtifactStorageInfo(
    190     name=artifact_info["name"],
    191     version=artifact_info["version"],
   (...)
    199     is_file=Path(local_path).is_file(),
    200 )
    201 return artifact

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_artifacts/_blob_storage_helper.py:103, in BlobStorageClient.upload(self, source, name, version, ignore_file, asset_hash, show_progress)
    101 # start upload
    102 if os.path.isdir(source):
--> 103     upload_directory(
    104         storage_client=self,
    105         source=source,
    106         dest=asset_id,
    107         msg=msg,
    108         show_progress=show_progress,
    109         ignore_file=ignore_file,
    110     )
    111 else:
    112     self.indicator_file = dest

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_utils/_asset_utils.py:645, in upload_directory(storage_client, source, dest, msg, show_progress, ignore_file)
    643 with tqdm(total=total_size, desc=msg, ascii=is_windows) as pbar:
    644     for future in as_completed(futures_dict):
--> 645         future.result()  # access result to propagate any exceptions
    646         file_path_name = futures_dict[future][0]
    647         pbar.update(size_dict.get(file_path_name) or 0)

File ~/cluster-env/clonedenv/lib/python3.10/concurrent/futures/_base.py:451, in Future.result(self, timeout)
    449     raise CancelledError()
    450 elif self._state == FINISHED:
--> 451     return self.__get_result()
    453 self._condition.wait(timeout)
    455 if self._state in [CANCELLED, CANCELLED_AND_NOTIFIED]:

File ~/cluster-env/clonedenv/lib/python3.10/concurrent/futures/_base.py:403, in Future.__get_result(self)
    401 if self._exception:
    402     try:
--> 403         raise self._exception
    404     finally:
    405         # Break a reference cycle with the exception in self._exception
    406         self = None

File ~/cluster-env/clonedenv/lib/python3.10/concurrent/futures/thread.py:58, in _WorkItem.run(self)
     55     return
     57 try:
---> 58     result = self.fn(*self.args, **self.kwargs)
     59 except BaseException as exc:
     60     self.future.set_exception(exc)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/ai/ml/_utils/_asset_utils.py:536, in upload_file(storage_client, source, dest, msg, size, show_progress, in_directory, callback)
    528             storage_client.file_client.upload_data(
    529                 data=data.read(),
    530                 overwrite=True,
   (...)
    533                 max_concurrency=MAX_CONCURRENCY,
    534             )
    535         elif type(storage_client).__name__ == BLOB_STORAGE_CLIENT_NAME:
--> 536             storage_client.container_client.upload_blob(
    537                 name=dest,
    538                 data=data,
    539                 validate_content=validate_content,
    540                 overwrite=storage_client.overwrite,
    541                 raw_response_hook=callback,
    542                 max_concurrency=MAX_CONCURRENCY,
    543                 connection_timeout=DEFAULT_CONNECTION_TIMEOUT,
    544             )
    546 storage_client.uploaded_file_count += 1

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/core/tracing/decorator.py:76, in distributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer(*args, **kwargs)
     74 span_impl_type = settings.tracing_implementation()
     75 if span_impl_type is None:
---> 76     return func(*args, **kwargs)
     78 # Merge span is parameter is set, but only if no explicit parent are passed
     79 if merge_span and not passed_in_parent:

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/storage/blob/_container_client.py:979, in ContainerClient.upload_blob(self, name, data, blob_type, length, metadata, **kwargs)
    977 timeout = kwargs.pop('timeout', None)
    978 encoding = kwargs.pop('encoding', 'UTF-8')
--> 979 blob.upload_blob(
    980     data,
    981     blob_type=blob_type,
    982     length=length,
    983     metadata=metadata,
    984     timeout=timeout,
    985     encoding=encoding,
    986     **kwargs
    987 )
    988 return blob

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/core/tracing/decorator.py:76, in distributed_trace.<locals>.decorator.<locals>.wrapper_use_tracer(*args, **kwargs)
     74 span_impl_type = settings.tracing_implementation()
     75 if span_impl_type is None:
---> 76     return func(*args, **kwargs)
     78 # Merge span is parameter is set, but only if no explicit parent are passed
     79 if merge_span and not passed_in_parent:

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/storage/blob/_blob_client.py:731, in BlobClient.upload_blob(self, data, blob_type, length, metadata, **kwargs)
    724 options = self._upload_blob_options(
    725     data,
    726     blob_type=blob_type,
    727     length=length,
    728     metadata=metadata,
    729     **kwargs)
    730 if blob_type == BlobType.BlockBlob:
--> 731     return upload_block_blob(**options)
    732 if blob_type == BlobType.PageBlob:
    733     return upload_page_blob(**options)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/storage/blob/_upload_helpers.py:197, in upload_block_blob(client, data, stream, length, overwrite, headers, validate_content, max_concurrency, blob_settings, encryption_options, **kwargs)
    195 except HttpResponseError as error:
    196     try:
--> 197         process_storage_error(error)
    198     except ResourceModifiedError as mod_error:
    199         if not overwrite:

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/azure/storage/blob/_shared/response_handlers.py:181, in process_storage_error(storage_error)
    178 error.args = (error.message,)
    179 try:
    180     # `from None` prevents us from double printing the exception (suppresses generated layer error context)
--> 181     exec("raise error from None")   # pylint: disable=exec-used # nosec
    182 except SyntaxError:
    183     raise error

File <string>:1

ClientAuthenticationError: Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:0c4d5c9a-101e-0026-0ddd-bd256e000000
Time:2023-07-24T03:17:43.5053201Z
ErrorCode:AuthenticationFailed
authenticationerrordetail:Signature did not match. String to sign used was rcwl
2023-07-24T03:07:40Z
2023-07-24T11:17:40Z
/blob/mlwsdev/filecache
313df90b-deec-4491-b8b0-1cc503a9819a
eb2e9916-19a4-4566-84f3-5a6da1f55beb
2023-07-24T01:43:59Z
2023-07-25T09:53:59Z
b
2019-07-07

2021-10-04
c

Content: <?xml version="1.0" encoding="utf-8"?><Error><Code>AuthenticationFailed</Code><Message>Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:0c4d5c9a-101e-0026-0ddd-bd256e000000
Time:2023-07-24T03:17:43.5053201Z</Message><AuthenticationErrorDetail>Signature did not match. String to sign used was rcwl
2023-07-24T03:07:40Z
2023-07-24T11:17:40Z
/blob/mlwsdev/filecache
313df90b-deec-4491-b8b0-1cc503a9819a
eb2e9916-19a4-4566-84f3-5a6da1f55beb
2023-07-24T01:43:59Z
2023-07-25T09:53:59Z
b
2019-07-07

2021-10-04
c

</AuthenticationErrorDetail></Error>
xiangyan99 commented 1 year ago

Thanks for the feedback, we’ll investigate asap.

github-actions[bot] commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @simorenoh @gahl-levy @bambriz @azureml-github @Azure/azure-ml-sdk.

waqassiddiqi commented 1 year ago

any update on this guys?

shivmistry605 commented 1 year ago

Faced the same issue when submitting a training job within synapse notebook, where it begins uploading to the container but fails with an authentication error. Is there any update for this yet?

achauhan-scc commented 2 weeks ago

closing as obsolete