Azure / AML-Kubernetes

AzureML customer managed k8s compute samples
MIT License
80 stars 33 forks source link

"az ml online-deployment" with non defaultinstancetype fails with Orleans error message #249

Closed joaocc closed 2 years ago

joaocc commented 2 years ago

We are trying to deploy a new online-endpoint (kubernetes) and deployment, using yaml files and scripts that worked well in the recent future. Now, creating or updating the deployment fails with error:

InferencingClientCreateDeploymentFailed) InferencingClient HttpRequest error, error detail: Object reference not set to an instance of an object., internal stacktrace [ at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.ActionGrains.Validation.ComputeTargetValidationGrain.ValidateInstanceTypeAsync(String instanceTypeName, ResourceRequirements requirements, ComputeTargetData computeTargetData, CancellationToken token) in /mnt/vss/_work/1/s/src/Silo/Grains/ActionGrains/Validation/ComputeTargetValidationGrain.cs:line 86

Deployment config file

---
$schema: https://azuremlschemas.azureedge.net/latest/kubernetesOnlineDeployment.schema.json

endpoint_name: MY_AZML_ENDPOINT_NAME
type: kubernetes
name: MY_AZML_ONLINE_EP_DEP_NAME

instance_type: INST_TYPE_X11
# instance_type: defaultinstancetype
instance_count: 1

scale_settings:
  type: default

app_insights_enabled: false

model: "azureml:MY_AZML_MODEL_NAME:1"

environment:
  image: MY_ACR_NAME.azurecr.io/MY_IMAGE_NAME:MY_TAG

  inference_config:
    liveness_route:
      port: 5001
      path: /score
    readiness_route:
      port: 5001
      path: /score
    scoring_route:
      port: 5001
      path: /score

liveness_probe:
  period: 15

readiness_probe:
  period: 15

environment_variables:

  ENV_1: "var_1"

Log of execution fail

# az ml online-deployment create -f ./azml.deployment.yml --all-traffic -o json --resource-group MY_RG_NAME --subscription MY_SUBS_ID --workspace-name AZML_WKS_NAME --verbose
All traffic will be set to deployment MY_AZML_ONLINE_EP_DEP_NAME once it has been provisioned.
If you interrupt this command or it times out while waiting for the provisioning, you can try to set all the traffic to this deployment later once its has been provisioned.
Check: endpoint MY_AZML_ENDPOINT_NAME exists
Request URL: 'https://management.azure.com/subscriptions/MY_SUBS_ID/resourceGroups/MY_RG_NAME/providers/Microsoft.MachineLearningServices/workspaces/AZML_WKS_NAME/onlineEndpoints/MY_AZML_ENDPOINT_NAME?api-version=REDACTED'
Request method: 'GET'
Request headers:
    'Accept': 'application/json'
    'x-ms-client-request-id': '2bce357c-ff0c-11ec-b617-1e008a0a5052'
    'User-Agent': 'azureml-cli-v2/2.5.0 azure-ai-ml/2.5.0 azsdk-python-mgmt-machinelearningservices/0.1.0 Python/3.10.5 (macOS-11.6.7-x86_64-i386-64bit)'
    'Authorization': 'REDACTED'
No body was attached to the request
Response status: 200
Response headers:
    'Cache-Control': 'no-cache'
    'Pragma': 'no-cache'
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; charset=utf-8'
    'Content-Encoding': 'REDACTED'
    'Expires': '-1'
    'Vary': 'REDACTED'
    'x-ms-request-id': '3886eeb2-61eb-4d3d-90bd-df83f0dd8f45'
    'x-ms-ratelimit-remaining-subscription-reads': '11998'
    'Request-Context': 'REDACTED'
    'x-ms-response-type': 'REDACTED'
    'Strict-Transport-Security': 'REDACTED'
    'X-Content-Type-Options': 'REDACTED'
    'x-aml-cluster': 'REDACTED'
    'x-request-time': 'REDACTED'
    'Server-Timing': 'REDACTED'
    'x-ms-correlation-request-id': 'REDACTED'
    'x-ms-routing-request-id': 'REDACTED'
    'Date': 'Fri, 08 Jul 2022 22:20:30 GMT'
Request URL: 'https://management.azure.com/subscriptions/MY_SUBS_ID/resourceGroups/MY_RG_NAME/providers/Microsoft.MachineLearningServices/workspaces/AZML_WKS_NAME/environments/CliV2AnonymousEnvironment/versions/aeaea3e44f953b970b25ef88604b52c3?api-version=REDACTED'
Request method: 'PUT'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '334'
    'Accept': 'application/json'
    'x-ms-client-request-id': '2bce357c-ff0c-11ec-b617-1e008a0a5052'
    'User-Agent': 'azureml-cli-v2/2.5.0 azure-ai-ml/2.5.0 azsdk-python-mgmt-machinelearningservices/0.1.0 Python/3.10.5 (macOS-11.6.7-x86_64-i386-64bit)'
    'Authorization': 'REDACTED'
A body is sent with the request
Response status: 201
Response headers:
    'Cache-Control': 'no-cache'
    'Pragma': 'no-cache'
    'Content-Length': '1231'
    'Content-Type': 'application/json; charset=utf-8'
    'Expires': '-1'
    'Location': 'REDACTED'
    'x-ms-ratelimit-remaining-subscription-writes': '1199'
    'Request-Context': 'REDACTED'
    'x-ms-response-type': 'REDACTED'
    'Strict-Transport-Security': 'REDACTED'
    'X-Content-Type-Options': 'REDACTED'
    'x-aml-cluster': 'REDACTED'
    'x-request-time': 'REDACTED'
    'Server-Timing': 'REDACTED'
    'x-ms-request-id': 'd67aebad-e93c-475b-acfb-17c4230ab48e'
    'x-ms-correlation-request-id': 'REDACTED'
    'x-ms-routing-request-id': 'REDACTED'
    'Date': 'Fri, 08 Jul 2022 22:20:32 GMT'
Request URL: 'https://management.azure.com/subscriptions/MY_SUBS_ID/resourceGroups/MY_RG_NAME/providers/Microsoft.MachineLearningServices/workspaces/AZML_WKS_NAME?api-version=REDACTED'
Request method: 'GET'
Request headers:
    'Accept': 'application/json'
    'x-ms-client-request-id': '2bce357c-ff0c-11ec-b617-1e008a0a5052'
    'User-Agent': 'azureml-cli-v2/2.5.0 azure-ai-ml/2.5.0 azsdk-python-mgmt-machinelearningservices/0.1.0 Python/3.10.5 (macOS-11.6.7-x86_64-i386-64bit)'
    'Authorization': 'REDACTED'
No body was attached to the request
Response status: 200
Response headers:
    'Cache-Control': 'no-cache'
    'Pragma': 'no-cache'
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; charset=utf-8'
    'Content-Encoding': 'REDACTED'
    'Expires': '-1'
    'Vary': 'REDACTED'
    'x-ms-request-id': '8c9afa5f-e83b-4224-a9aa-519e12bf3e55'
    'x-ms-ratelimit-remaining-subscription-reads': '11999'
    'Request-Context': 'REDACTED'
    'x-ms-response-type': 'REDACTED'
    'Strict-Transport-Security': 'REDACTED'
    'X-Content-Type-Options': 'REDACTED'
    'x-aml-cluster': 'REDACTED'
    'x-request-time': 'REDACTED'
    'Server-Timing': 'REDACTED'
    'x-ms-correlation-request-id': 'REDACTED'
    'x-ms-routing-request-id': 'REDACTED'
    'Date': 'Fri, 08 Jul 2022 22:20:33 GMT'
Request URL: 'https://management.azure.com/subscriptions/MY_SUBS_ID/resourceGroups/MY_RG_NAME/providers/Microsoft.MachineLearningServices/workspaces/AZML_WKS_NAME/onlineEndpoints/MY_AZML_ENDPOINT_NAME/deployments/MY_AZML_ONLINE_EP_DEP_NAME?api-version=REDACTED'
Request method: 'PUT'
Request headers:
    'Content-Type': 'application/json'
    'Content-Length': '2057'
    'Accept': 'application/json'
    'x-ms-client-request-id': '2bce357c-ff0c-11ec-b617-1e008a0a5052'
    'User-Agent': 'azureml-cli-v2/2.5.0 azure-ai-ml/2.5.0 azsdk-python-mgmt-machinelearningservices/0.1.0 Python/3.10.5 (macOS-11.6.7-x86_64-i386-64bit)'
    'Authorization': 'REDACTED'
A body is sent with the request
Response status: 201
Response headers:
    'Cache-Control': 'no-cache'
    'Pragma': 'no-cache'
    'Content-Length': '4028'
    'Content-Type': 'application/json; charset=utf-8'
    'Expires': '-1'
    'Location': 'REDACTED'
    'x-ms-ratelimit-remaining-subscription-resource-requests': '24'
    'Request-Context': 'REDACTED'
    'x-ms-response-type': 'REDACTED'
    'Azure-AsyncOperation': 'REDACTED'
    'x-ms-async-operation-timeout': 'REDACTED'
    'Strict-Transport-Security': 'REDACTED'
    'X-Content-Type-Options': 'REDACTED'
    'x-aml-cluster': 'REDACTED'
    'x-request-time': 'REDACTED'
    'Server-Timing': 'REDACTED'
    'x-ms-request-id': '14f056ac-4c82-4570-a4d7-29b544a6bfe7'
    'x-ms-correlation-request-id': 'REDACTED'
    'x-ms-routing-request-id': 'REDACTED'
    'Date': 'Fri, 08 Jul 2022 22:20:36 GMT'
Creating/updating online deployment MY_AZML_ONLINE_EP_DEP_NAME Request URL: 'https://management.azure.com/subscriptions/MY_SUBS_ID/providers/Microsoft.MachineLearningServices/locations/westeurope/mfeOperationsStatus/od:3d462882-ab47-404e-9cc6-7331d898843c:8fa170f0-c961-4c65-8937-d54300000000?api-version=REDACTED'
Request method: 'GET'
Request headers:
    'x-ms-client-request-id': '2bce357c-ff0c-11ec-b617-1e008a0a5052'
    'User-Agent': 'azureml-cli-v2/2.5.0 azure-ai-ml/2.5.0 azsdk-python-mgmt-machinelearningservices/0.1.0 Python/3.10.5 (macOS-11.6.7-x86_64-i386-64bit)'
    'Authorization': 'REDACTED'
No body was attached to the request
Response status: 200
Response headers:
    'Cache-Control': 'no-cache'
    'Pragma': 'no-cache'
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; charset=utf-8'
    'Content-Encoding': 'REDACTED'
    'Expires': '-1'
    'Vary': 'REDACTED'
    'x-ms-request-id': '3929776a-e6a9-451a-b31c-3e3bd5ea2485'
    'x-ms-ratelimit-remaining-subscription-reads': '11997'
    'Request-Context': 'REDACTED'
    'x-ms-response-type': 'REDACTED'
    'Strict-Transport-Security': 'REDACTED'
    'X-Content-Type-Options': 'REDACTED'
    'x-aml-cluster': 'REDACTED'
    'x-request-time': 'REDACTED'
    'Server-Timing': 'REDACTED'
    'x-ms-correlation-request-id': 'REDACTED'
    'x-ms-routing-request-id': 'REDACTED'
    'Date': 'Fri, 08 Jul 2022 22:20:41 GMT'
.Request URL: 'https://management.azure.com/subscriptions/MY_SUBS_ID/providers/Microsoft.MachineLearningServices/locations/westeurope/mfeOperationsStatus/od:3d462882-ab47-404e-9cc6-7331d898843c:8fa170f0-c961-4c65-8937-d54300000000?api-version=REDACTED'
Request method: 'GET'
Request headers:
    'x-ms-client-request-id': '2bce357c-ff0c-11ec-b617-1e008a0a5052'
    'User-Agent': 'azureml-cli-v2/2.5.0 azure-ai-ml/2.5.0 azsdk-python-mgmt-machinelearningservices/0.1.0 Python/3.10.5 (macOS-11.6.7-x86_64-i386-64bit)'
    'Authorization': 'REDACTED'
No body was attached to the request
Response status: 200
Response headers:
    'Cache-Control': 'no-cache'
    'Pragma': 'no-cache'
    'Transfer-Encoding': 'chunked'
    'Content-Type': 'application/json; charset=utf-8'
    'Content-Encoding': 'REDACTED'
    'Expires': '-1'
    'Vary': 'REDACTED'
    'x-ms-request-id': '75592e60-3955-4c0c-836e-4d0877be92f0'
    'x-ms-ratelimit-remaining-subscription-reads': '11996'
    'Request-Context': 'REDACTED'
    'x-ms-response-type': 'REDACTED'
    'Strict-Transport-Security': 'REDACTED'
    'X-Content-Type-Options': 'REDACTED'
    'x-aml-cluster': 'REDACTED'
    'x-request-time': 'REDACTED'
    'Server-Timing': 'REDACTED'
    'x-ms-correlation-request-id': 'REDACTED'
    'x-ms-routing-request-id': 'REDACTED'
    'Date': 'Fri, 08 Jul 2022 22:20:47 GMT'
.(InferencingClientCreateDeploymentFailed) InferencingClient HttpRequest error, error detail: Object reference not set to an instance of an object., internal stacktrace [   at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.ActionGrains.Validation.ComputeTargetValidationGrain.ValidateInstanceTypeAsync(String instanceTypeName, ResourceRequirements requirements, ComputeTargetData computeTargetData, CancellationToken token) in /mnt/vss/_work/1/s/src/Silo/Grains/ActionGrains/Validation/ComputeTargetValidationGrain.cs:line 86
   at Microsoft.MachineLearning.InferenceDeployment.Grains.Interfaces.OrleansCodeGenComputeTargetValidationGrainMethodInvoker.Invoke(IAddressable grain, InvokeMethodRequest request) in /mnt/vss/_work/1/s/src/Silo/Interfaces/obj/Release/net6.0/Microsoft.MachineLearning.InferenceDeployment.Grains.Interfaces.orleans.g.cs:line 1029
   at Orleans.Runtime.GrainMethodInvoker.Invoke()
   at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.FlowGrains.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /mnt/vss/_work/1/s/src/Silo/Grains/FlowGrains/ExceptionConversionFilter.cs:line 71]
Code: InferencingClientCreateDeploymentFailed
Message: InferencingClient HttpRequest error, error detail: Object reference not set to an instance of an object., internal stacktrace [   at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.ActionGrains.Validation.ComputeTargetValidationGrain.ValidateInstanceTypeAsync(String instanceTypeName, ResourceRequirements requirements, ComputeTargetData computeTargetData, CancellationToken token) in /mnt/vss/_work/1/s/src/Silo/Grains/ActionGrains/Validation/ComputeTargetValidationGrain.cs:line 86
   at Microsoft.MachineLearning.InferenceDeployment.Grains.Interfaces.OrleansCodeGenComputeTargetValidationGrainMethodInvoker.Invoke(IAddressable grain, InvokeMethodRequest request) in /mnt/vss/_work/1/s/src/Silo/Interfaces/obj/Release/net6.0/Microsoft.MachineLearning.InferenceDeployment.Grains.Interfaces.orleans.g.cs:line 1029
   at Orleans.Runtime.GrainMethodInvoker.Invoke()
   at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.FlowGrains.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /mnt/vss/_work/1/s/src/Silo/Grains/FlowGrains/ExceptionConversionFilter.cs:line 71]
Exception Details:      (InferencingClientCreateDeploymentFailed) InferencingClient HttpRequest error, error detail: Object reference not set to an instance of an object., internal stacktrace [   at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.ActionGrains.Validation.ComputeTargetValidationGrain.ValidateInstanceTypeAsync(String instanceTypeName, ResourceRequirements requirements, ComputeTargetData computeTargetData, CancellationToken token) in /mnt/vss/_work/1/s/src/Silo/Grains/ActionGrains/Validation/ComputeTargetValidationGrain.cs:line 86
           at Microsoft.MachineLearning.InferenceDeployment.Grains.Interfaces.OrleansCodeGenComputeTargetValidationGrainMethodInvoker.Invoke(IAddressable grain, InvokeMethodRequest request) in /mnt/vss/_work/1/s/src/Silo/Interfaces/obj/Release/net6.0/Microsoft.MachineLearning.InferenceDeployment.Grains.Interfaces.orleans.g.cs:line 1029
           at Orleans.Runtime.GrainMethodInvoker.Invoke()
           at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.FlowGrains.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /mnt/vss/_work/1/s/src/Silo/Grains/FlowGrains/ExceptionConversionFilter.cs:line 71]
        The build log is available in the workspace blob store "MY_AZML_WKS_STOR_LOG" under the path "/azureml/ImageLogs/8fa170f0-c961-4c65-8937-d54300000000/build.log"
        Code: InferencingClientCreateDeploymentFailed
        Message: InferencingClient HttpRequest error, error detail: Object reference not set to an instance of an object., internal stacktrace [   at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.ActionGrains.Validation.ComputeTargetValidationGrain.ValidateInstanceTypeAsync(String instanceTypeName, ResourceRequirements requirements, ComputeTargetData computeTargetData, CancellationToken token) in /mnt/vss/_work/1/s/src/Silo/Grains/ActionGrains/Validation/ComputeTargetValidationGrain.cs:line 86
           at Microsoft.MachineLearning.InferenceDeployment.Grains.Interfaces.OrleansCodeGenComputeTargetValidationGrainMethodInvoker.Invoke(IAddressable grain, InvokeMethodRequest request) in /mnt/vss/_work/1/s/src/Silo/Interfaces/obj/Release/net6.0/Microsoft.MachineLearning.InferenceDeployment.Grains.Interfaces.orleans.g.cs:line 1029
           at Orleans.Runtime.GrainMethodInvoker.Invoke()
           at Microsoft.MachineLearning.InferenceDeployment.Silo.Grains.FlowGrains.ExceptionConversionFilter.Invoke(IIncomingGrainCallContext context) in /mnt/vss/_work/1/s/src/Silo/Grains/FlowGrains/ExceptionConversionFilter.cs:line 71]
        The build log is available in the workspace blob store "MY_AZML_WKS_STOR_LOG" under the path "/azureml/ImageLogs/8fa170f0-c961-4c65-8937-d54300000000/build.log"
Command ran in 19.810 seconds (init: 0.205, invoke: 19.605)
make: *** [x--do-deploy-endpoint--3sta--new] Error 1
joaocc commented 2 years ago

Setting instance type to defaultinstancetypeno longer throws an error...

joaocc commented 2 years ago

The following instance type definition:

apiVersion: amlarc.azureml.com/v1alpha1
kind: InstanceType
metadata:
  name: MY_INST_TYPE_NAME
spec:
  nodeSelector:
    eks.amazonaws.com/capacityType: SPOT
    nvidia.com/gpu.product: Tesla-T4

Deployment fails with the following MY_INST_TYPE_NAME values:

All of these names appear on kubectl get instancetypes

joaocc commented 2 years ago

As discussed with @Zhong-J , this was caused by the fact that instanceType was defined with only nodeSelector, while it should have resource requests. Thanks for the prompt response.