Azure / azure-cli-extensions

Public Repository for Extensions of Azure CLI.
https://docs.microsoft.com/en-us/cli/azure
MIT License
383 stars 1.23k forks source link

Unclear error in AzureML Computer job #3574

Open buildgreatthings opened 3 years ago

buildgreatthings commented 3 years ago

Extension name (the extension in question)

I am using the Azure ML 2.0 CLI.

Description of issue (in as much detail as possible)

I built a container that I upload to ACR with my code and artifacts in place. I can run the container locally and get its to start successfully. The entry point starts from /opt/main.py. However, when I run it into Azure ML, it abruptly breaks with the status below.

AzureMLCompute job failed. JobFailed: Submitted script failed with a non-zero exit code; see the driver log file for details. Reason: Job failed with non-zero exit Code

My YAML that I am launching the CLI with is as follows:

$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json command: >- python /opt/main.py --set training-data={inputs.training} validation-data={inputs.validation} testing-data={inputs.testing} inputs: training: data: path: https://****.blob.core.windows.net/modeldata/data1.csv mode: mount validation: data: path: https://****.blob.core.windows.net/modeldata/data2.csv mode: mount testing: data: path: https://****.blob.core.windows.net/modeldata/data3.csv mode: mount environment: name: sklearn-train version: 1 docker: image: .azurecr.io//train-image:latest

compute: target: azureml:/subscriptions/c24881db-d5d4-482e-b582-d17f74d863ef/resourceGroups//providers/Microsoft.MachineLearningServices/workspaces//computes/****-test experiment_name: sk_learn description: Train a sklearn model

When I go to the driver logs, this is all that's there.

2021/07/01 23:00:04 Starting App Insight Logger for task: runTaskLet 2021/07/01 23:00:04 Version: 3.0.01632.0003 Branch: .SourceBranch Commit: 4b96fb0 2021/07/01 23:00:04 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/info 2021/07/01 23:00:04 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status [2021-07-01T23:00:04.789779] Entering context manager injector.


ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details
- If the issue is to do with Azure CLI 2.0 in-particular, create an issue here at [Azure/azure-cli](https://github.com/Azure/azure-cli/issues) ### Extension name (the extension in question) I am using the Azure ML 2.0 CLI. ### Description of issue (in as much detail as possible) I built a container that I upload to ACR with my code and artifacts in place. I can run the container locally and get its to start successfully. The entry point starts from `/opt/main.py`. However, when I run it into Azure ML, it abruptly breaks with the status below. > AzureMLCompute job failed. > JobFailed: Submitted script failed with a non-zero exit code; see the driver log file for details. > Reason: Job failed with non-zero exit Code My YAML that I am launching the CLI with is as follows: > $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json > command: >- > python /opt/main.py --set > training-data={inputs.training} > validation-data={inputs.validation} > testing-data={inputs.testing} > inputs: > training: > data: > path: https://****.blob.core.windows.net/modeldata/data1.csv > mode: mount > validation: > data: > path: https://****.blob.core.windows.net/modeldata/data2.csv > mode: mount > testing: > data: > path: https://****.blob.core.windows.net/modeldata/data3.csv > mode: mount > environment: > name: sklearn-train > version: 1 > docker: > image: ****.azurecr.io/****/train-image:latest > > > compute: > target: azureml:/subscriptions/c24881db-d5d4-482e-b582-d17f74d863ef/resourceGroups/****/providers/Microsoft.MachineLearningServices/workspaces/****/computes/****-test > experiment_name: sk_learn > description: Train a sklearn model When I go to the driver logs, this is all that's there. 2021/07/01 23:00:04 Starting App Insight Logger for task: runTaskLet 2021/07/01 23:00:04 Version: 3.0.01632.0003 Branch: .SourceBranch Commit: 4b96fb0 2021/07/01 23:00:04 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/info 2021/07/01 23:00:04 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status [2021-07-01T23:00:04.789779] Entering context manager injector. -----
Author: awcchungster
Assignees: -
Labels: `Machine Learning`, `Service Attention`, `extension/ml`
Milestone: -
yonzhan commented 3 years ago

route to service team

banibrata commented 3 years ago

@awcchungster , besides driver logs, could you share all other logs please.

buildgreatthings commented 3 years ago

I found a workaround, which is to remove the entry point designation when I build my container. Is there any reason why Azure ML runtime breaks when an entry point is set?

buildgreatthings commented 3 years ago

Before I had this line in the docker file. When I remove this line and rebuild, the error goes away.

ENTRYPOINT ["python", "/opt/main.py"]

banibrata commented 3 years ago

For this: "Is there any reason why Azure ML runtime breaks when an entry point is set?" it needs little bit of analysis so need all the logs files from the previous failed run

buildgreatthings commented 3 years ago

What's your @microsoft email? I can forward them to you.

banibrata commented 3 years ago

banibrata.de@microsoft.com

ghost commented 3 years ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @lostmygithubaccount.

Issue Details
- If the issue is to do with Azure CLI 2.0 in-particular, create an issue here at [Azure/azure-cli](https://github.com/Azure/azure-cli/issues) ### Extension name (the extension in question) I am using the Azure ML 2.0 CLI. ### Description of issue (in as much detail as possible) I built a container that I upload to ACR with my code and artifacts in place. I can run the container locally and get its to start successfully. The entry point starts from `/opt/main.py`. However, when I run it into Azure ML, it abruptly breaks with the status below. > AzureMLCompute job failed. > JobFailed: Submitted script failed with a non-zero exit code; see the driver log file for details. > Reason: Job failed with non-zero exit Code My YAML that I am launching the CLI with is as follows: > $schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json > command: >- > python /opt/main.py --set > training-data={inputs.training} > validation-data={inputs.validation} > testing-data={inputs.testing} > inputs: > training: > data: > path: https://****.blob.core.windows.net/modeldata/data1.csv > mode: mount > validation: > data: > path: https://****.blob.core.windows.net/modeldata/data2.csv > mode: mount > testing: > data: > path: https://****.blob.core.windows.net/modeldata/data3.csv > mode: mount > environment: > name: sklearn-train > version: 1 > docker: > image: ****.azurecr.io/****/train-image:latest > > > compute: > target: azureml:/subscriptions/c24881db-d5d4-482e-b582-d17f74d863ef/resourceGroups/****/providers/Microsoft.MachineLearningServices/workspaces/****/computes/****-test > experiment_name: sk_learn > description: Train a sklearn model When I go to the driver logs, this is all that's there. 2021/07/01 23:00:04 Starting App Insight Logger for task: runTaskLet 2021/07/01 23:00:04 Version: 3.0.01632.0003 Branch: .SourceBranch Commit: 4b96fb0 2021/07/01 23:00:04 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/info 2021/07/01 23:00:04 Attempt 1 of http call to http://10.0.0.5:16384/sendlogstoartifacts/status [2021-07-01T23:00:04.789779] Entering context manager injector. -----
Author: awcchungster
Assignees: -
Labels: `ADO`, `ML-MLOps`, `Machine Learning`, `Service Attention`, `bug`, `extension/ml`
Milestone: -