Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.37k stars 2.71k forks source link

configure_azure_monitor() takes abnormally long time #34902

Open greatvovan opened 3 months ago

greatvovan commented 3 months ago

Describe the bug

With default configuration (absence of configuration), time of execution of configure_azure_monitor() takes abnormally long time: ~10 seconds.

To Reproduce

long.py:

import time
from azure.monitor.opentelemetry import configure_azure_monitor

start = time.monotonic()
configure_azure_monitor()
print(time.monotonic() - start)

APPLICATIONINSIGHTS_CONNECTION_STRING="..." python long.py

Expected behavior

Reasonable time to configure (< 1 s).

Additional context

After running in debugger I discovered two main code places contributing to the delay, and both are related to checking the fact of running in an Azure VM.

  1. Resource detection.

Location: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/37aba928d45713842941c7efc992726a79ea7d8a/resource/opentelemetry-resource-detector-azure/src/opentelemetry/resource/detector/azure/vm.py#L77

The way code gets there:

image

Then in https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/resource/opentelemetry-resource-detector-azure/src/opentelemetry/resource/detector/azure/vm.py

image



  1. Statsbeat metrics

Location: https://github.com/Azure/azure-sdk-for-python/blob/a9b8513bb26ce0e9dc4eac15dcf897127bd2e954/sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/statsbeat/_statsbeat_metrics.py#L212-L215

Call stack:

image



In both cases the delay is related to requests to this endpoint:

http://169.254.169.254/metadata/instance/compute
though, to different API versions. The first place has request timeout of 4 seconds, and the second place has 5 seconds, which together constitute almost the entire time of the startup delay.

Workarounds

  1. Exclude Azure resource detectors with help of setting OTEL_EXPERIMENTAL_RESOURCE_DETECTORS=otel environment variable. If not set, the library sets the default value, that includes App Service and Azure VM.
  2. Disable sending statsbeat using APPLICATIONINSIGHTS_STATSBEAT_DISABLED_ALL=TRUE

The above tweaks bring the configuration time down to ~0.8 s (and with OTEL_PYTHON_DISABLED_INSTRUMENTATIONS set to azure_sdk,django,fastapi,flask,psycopg2,requests,urllib,urllib3 it completes under 30 ms).

It took me hours to find the above options for fixing the startup time without touching the code. I think we need to make the library friendlier to running in non-Azure environments.

github-actions[bot] commented 3 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jeremydvoss @lzchen.

lzchen commented 3 months ago

@greatvovan

Thanks a bunch for taking a look into this and providing insights. The azure monitor distro is not designed particular with performance in mind. It is a one-stop-shop telemetry solution for users to get all telemetry pillars and a select number of default instrumentations bundled together to make an uber telemetry solution, so we focus more on simplicity and functionality over performance. With that being said, we deliberately did not expose the OTEL_EXPERIMENTAL_RESOURCE_DETECTORS or APPLICATIONINSIGHTS_STATSBEAT_DISABLED_ALL because those are meant to either be a default value or not set to collect valuable diagnostic information in exchange for cpu processing in the background. As well, you may have gotten your config time down to ~0.8s by setting OTEL_PYTHON_DISABLED_INSTRUMENTATIONS but what is the purpose of a distro that does not automatically collect anything, even if it is fast during startup?

I think we need to be more specific on the problem statement, as in, what is the problem we are trying to solve and for what cohort of customers? I think it should be reworded as: for those customers not in Azure environments, we should not have wait for the azure meta data service that you have pointed out causes the application to wait for almost 10s. We probably should find a way to run this in parallel or during runtime in the background.

greatvovan commented 3 months ago

@lzchen thank you for your response.

I agree, there is no point to disable all that in the prod, after all, the startup time is usually not so critical (though controversial in many cases), those are rather dirty workarounds to remove annoying delays on developer's machine and possibly on other non-cloud deployments (note that disabling azure_vm resource detection seemingly has zero impact on functionality when run not on an Azure VM, but still saves 5 s of time).

Regarding statsbeat, the problem seems not in the statsbeat itself, but in the way it is initialized, because it is again trying to reach that metadata endpoint, after which it works without delays.

As for the cohort of customers, I think, in the first place those are Azure application developers, because before instrumentation makes its way to the cloud it must be coded on the developers's machine first (and this is my case). Having 10 s delay on startup is quite annoying tbh. Secondly, a category of customers could be those who are running the software on-prem but have other ties to Azure, and are willing to send logs to the cloud. Probably having a parameter disabling all Azure-bound tests would be enough.

lzchen commented 3 months ago

@greatvovan

I agree those are great points. We will investigate a way to optimize this so users will have a better experience.

lzchen commented 3 months ago

https://github.com/Azure/azure-sdk-for-python/pull/35039 and https://github.com/open-telemetry/opentelemetry-python-contrib/pull/2387 reduces the timeout to 200ms. This should alleviate some of the start up time in cases where users are not in an Azure vm.