Azure / azure-powershell

Microsoft Azure PowerShell
Other
4.26k stars 3.86k forks source link

Scripts hang due to reporting usage data #8714

Closed zakimaksyutov closed 5 years ago

zakimaksyutov commented 5 years ago

Description

A few days ago noticed that my deployment scripts (many Az commands) started handing at random places. Using Fiddler learned that every Az command tries to upload usage data and sometimes hangs for 2 minutes (and gets 500):

image

It is not consistent.

Expected behavior

I'm fine with uploading usage data and helping Azure Powershell team to improve Az package. But uploading this information MUST NOT break the main flow.

Please use -AsJob parameter for uploading usage data. Please also consider using batches instead of doing a call per request.

Workaround

Disable-AzDataCollection

After this command deployment scripts stopped hanging.

cormacpayne commented 5 years ago

@zakimaksyutov thanks for filing this issue -- we will take a look and let you know if there's any additional information need from your side

zakimaksyutov commented 5 years ago

I think you should do fire and forget approach for this telemetry - send it over the network and return. Reporting usage data must not interfere with main load.

Right now usage reporting adds overhead for every command. And usage telemetry endpoint might have quite different SLA (all SDKs do it asynchronously) than Azure ARM RP.

zakimaksyutov commented 5 years ago

Here is ingestion latency for usage data for one of the healthy clusters:

image

Note, by doing synchronous reporting you easily add 1-2 seconds latency on regular basis. Plus in many cases cmdlets will hang for way longer period of time up to 2 minutes.

The interesting part is that reported usage telemetry make it look that everything is good. When in fact usage telemetry itself significantly downgrades customer experience.

Usage ingestion endpoint does NOT have SLA for latency. It doesn't affect customers because all SDKs report data asynchronously and doesn't affect customer's main flow.

Azure Powershell uses client.Flush() - this is clearly against guidelines that Flush must not be used on main path.

Please consider sending data in fire and forget mode - either by implementing your own Telemetry Channel or by sending data directly using REST API without SDK.

markcowl commented 5 years ago

Closing as duplicate of #9095 Where we will track the new telemetry mechanism