Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4.09k stars 2.52k forks source link

Example for training on local compute target does not work - run stuck on "Starting" #1554

Open ishouldbedany opened 3 years ago

ishouldbedany commented 3 years ago

Environment

Steps

  1. Followed the configuration notebook successfully to configure access to my AML workspace.
  2. Followed the train-on-local notebook and submitted the simplest run possible, using a user-managed environment (section 6.A, although the behaviour is similar on system and Docker based environments).
  3. Experiments starts successfully and no error is reported. Experiment is available on the web UI.
  4. Upon checking, experiment is permanently in a "Starting..." status. There are no outputs/logs streamed but the snapshot of the source directory is correctly uploaded.

image

  1. When attaching to the experiment using the CLI client in debug mode (az ml job stream --debug etc etc), no errors are reported and the output is as shown below:
urllib3.connectionpool: Starting new HTTPS connection (1): westeurope.experiments.azureml.net:443
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None
urllib3.connectionpool: https://westeurope.experiments.azureml.net:443 "GET /history/v1.0/subscriptions/[MY_SUBSCRIPTION_ID]/resourceGroups/[MY_WORKSPACE]/providers/Microsoft.MachineLearningServices/workspaces/[MY_WORKSPACE]/experiments/train-on-local/runs/train-on-local_1627036359_71cdae8a/details HTTP/1.1" 200 None

And it continues ad aeternum. There are a couple of urllib3.connectionpool: Resetting dropped connection: westeurope.experiments.azureml.net logs in there every now and then, is this a problem?

Additional information

I wonder if there is any connection setting or firewall permission I am missing. I did not find such information in the docs and I can easily submit jobs to the remote compute targets. The behaviour when submitting jobs defined via an .yml file to a local compute target using the CLI (az ml job -f job.yml etc etc) is exactly the same.

This seems like a very standard workflow (and a great advantage of AML) but it is completely broken for me.

Thanks for any help or pointers in the right direction.

diondrapeck commented 3 years ago

Hi @ishouldbedany - Have you tried to run the example again since creating this issue? Oftentimes these sorts of logs indicate a transient network issue.

ishouldbedany commented 3 years ago

Hi @diondrapeck, thanks for dropping by. I have tried to reproduce this again and the behaviour is totally different - it seems to work now. When running these tests, I checked the Azure status page and everything seemed green for my region at the time, hence why I discarded the possibility of being something on Microsoft's side.

Any other place I should check when a situation like this arises? Maybe a more specific error message?

diondrapeck commented 3 years ago

@ishouldbedany - I'm glad you were able to get the sample working.

I've seen that error happen for a few reasons. One is if too many requests are being sent over the connection, but as you were using a simple example notebook, I think we can rule that out. Another possibility is your local connection; if you're on a wired connection trying to run locally as opposed to connecting via wifi, this can happen.

Unfortunately, urllib3 isn't very descriptive with their error messages, so there's not a fool-proof way to determine which of these (or neither) is the case, just usual debugging practices.

diondrapeck commented 3 years ago

@v-strudm-msft please close

ishouldbedany commented 3 years ago

@diondrapeck In case that helps, indeed I was running on a wired connection and trying to run local processing. I should have added that detail in my initial bug report, my bad.

diondrapeck commented 3 years ago

@ishouldbedany - great! I'm glad we were able to narrow it down.

potipot commented 3 years ago

I ran into the same issue with urllib debug info similiar to yours. I figured maybe I'm trying to send over and then recieve too much data. Modifying my .gitignore file to disinclude wandb logs and checkpoints seems to have solved the issue.