Closed jamesbannan closed 3 years ago
@jamesbannan Thank you for the feedback. We are actively investigating and will get back to you soon.
@PeterCLu Could we document this rule to be added to the NSG?
An update to this - the inbound port was a red herring. Turns out that the AML compute instances were attempting to talk to Azure Monitor. Adding this outbound rule resolved the issue (and deleting the inbound rule for TCP port 16384):
@jamesbannan Thanks for updating us. @PeterCLu It would be great to review the suggestion and update this rule in the screenshot for future updates to this document.
@jamesbannan, I just ran into this same issue. The outbound rule removed this error for me.
I've also noticed that my ML pipelines start up and finalize much more slowly than before. It seems like this issue is also related to Application Insights. I now am seeing this line of code:
2020/10/13 01:32:14 appinsightlogger.go:42: Time Out after 20 second retries for flushing the logs, doing another retry before exiting
show up in my 55_azureml-execution-tvmps_....txt
log. What can I do to fix this?
@bbertoni and @jamesbannan , thanks for your report.
@jhirono @aashishb, can you take a look at this? Are the proposed rules valid for official documentation? Can the fix be the cause of bbertoni's timeout issues above?
Thanks
@bbertoni and @jamesbannan, the product team is investigating this issue. We're working on understanding the issue and possible downstream effects like @bbertoni reported before we provide a doc fix.
Thanks for your patience
@jamesbannan, thanks for your patience. Our product team would like to reach out directly to figure out your use case and settings to get to the bottom of this before we move forward with product or doc changes. Could you reach out to me at peterlu@microsoft.com?
Also, if you could provide a screenshot of the exact error you were seeing, that would be very helpful.
Thanks so much!
@jamesbannan, since we haven't heard from you in a week, I'll proceed to #please-close this issue.
However, we still want to hear from you! Send me a message at the email above and I'll get you in contact with the product group to work through these issues and see if there a documentation or product resolution. Thanks so much for the report. I hope to hear back from you.
When configuring NSG rules on the subnet in which the AML compute targets are provisioned, if an Azure ML experiment has an error, the run cannot fail gracefully.
The run completes up to the error, and then gets stuck on the following command:
There are 10 attempts and then the count restarts at 1.
<IPADDRESS>
refers to a private IP address within the AML compute subnet. The run gets stuck and has to be cancelled.We resolved this by adding a new Inbound NSG rule with a
SourceServiceTag
ofAzureMachineLearning
and a destination port of16384
. Re-running an experiment with a deliberate error allowed the run to fail gracefully and release the assigned AML compute instance.Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.