MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.23k stars 21.41k forks source link

NSG rules prevent ML pipelines from failing gracefully #63850

Closed jamesbannan closed 3 years ago

jamesbannan commented 4 years ago

When configuring NSG rules on the subnet in which the AML compute targets are provisioned, if an Azure ML experiment has an error, the run cannot fail gracefully.

The run completes up to the error, and then gets stuck on the following command:

Attempt 1 of http call to http://<IPADDRESS>:16384/sendlogstoartifacts/info
Sending http request failed with error: Post http://<IPADDRESS>:16384/sendlogstoartifacts/info: dial tcp <IPADDRESS>:16384: connect: connection refused

There are 10 attempts and then the count restarts at 1. <IPADDRESS> refers to a private IP address within the AML compute subnet. The run gets stuck and has to be cancelled.

We resolved this by adding a new Inbound NSG rule with a SourceServiceTag of AzureMachineLearning and a destination port of 16384. Re-running an experiment with a deliberate error allowed the run to fail gracefully and release the assigned AML compute instance.


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

AnuragSharma-MSFT commented 4 years ago

@jamesbannan Thank you for the feedback. We are actively investigating and will get back to you soon.

RohitMungi-MSFT commented 4 years ago

@PeterCLu Could we document this rule to be added to the NSG?

jamesbannan commented 3 years ago

An update to this - the inbound port was a red herring. Turns out that the AML compute instances were attempting to talk to Azure Monitor. Adding this outbound rule resolved the issue (and deleting the inbound rule for TCP port 16384):

image

RohitMungi-MSFT commented 3 years ago

@jamesbannan Thanks for updating us. @PeterCLu It would be great to review the suggestion and update this rule in the screenshot for future updates to this document.

bbertoni commented 3 years ago

@jamesbannan, I just ran into this same issue. The outbound rule removed this error for me.

I've also noticed that my ML pipelines start up and finalize much more slowly than before. It seems like this issue is also related to Application Insights. I now am seeing this line of code:

2020/10/13 01:32:14 appinsightlogger.go:42: Time Out after 20 second retries for flushing the logs, doing another retry before exiting

show up in my 55_azureml-execution-tvmps_....txt log. What can I do to fix this?

PeterCLu commented 3 years ago

@bbertoni and @jamesbannan , thanks for your report.

@jhirono @aashishb, can you take a look at this? Are the proposed rules valid for official documentation? Can the fix be the cause of bbertoni's timeout issues above?

Thanks

PeterCLu commented 3 years ago

@bbertoni and @jamesbannan, the product team is investigating this issue. We're working on understanding the issue and possible downstream effects like @bbertoni reported before we provide a doc fix.

Thanks for your patience

PeterCLu commented 3 years ago

@jamesbannan, thanks for your patience. Our product team would like to reach out directly to figure out your use case and settings to get to the bottom of this before we move forward with product or doc changes. Could you reach out to me at peterlu@microsoft.com?

Also, if you could provide a screenshot of the exact error you were seeing, that would be very helpful.

Thanks so much!

PeterCLu commented 3 years ago

@jamesbannan, since we haven't heard from you in a week, I'll proceed to #please-close this issue.

However, we still want to hear from you! Send me a message at the email above and I'll get you in contact with the product group to work through these issues and see if there a documentation or product resolution. Thanks so much for the report. I hope to hear back from you.