Open greatvovan opened 3 months ago
Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jeremydvoss @lzchen.
@greatvovan
Thanks a bunch for taking a look into this and providing insights. The azure monitor distro is not designed particular with performance in mind. It is a one-stop-shop telemetry solution for users to get all telemetry pillars and a select number of default instrumentations bundled together to make an uber telemetry solution, so we focus more on simplicity and functionality over performance. With that being said, we deliberately did not expose the OTEL_EXPERIMENTAL_RESOURCE_DETECTORS
or APPLICATIONINSIGHTS_STATSBEAT_DISABLED_ALL
because those are meant to either be a default value or not set to collect valuable diagnostic information in exchange for cpu processing in the background. As well, you may have gotten your config time down to ~0.8s by setting OTEL_PYTHON_DISABLED_INSTRUMENTATIONS
but what is the purpose of a distro that does not automatically collect anything, even if it is fast during startup?
I think we need to be more specific on the problem statement, as in, what is the problem we are trying to solve and for what cohort of customers? I think it should be reworded as: for those customers not in Azure environments, we should not have wait for the azure meta data service that you have pointed out causes the application to wait for almost 10s. We probably should find a way to run this in parallel or during runtime in the background.
@lzchen thank you for your response.
I agree, there is no point to disable all that in the prod, after all, the startup time is usually not so critical (though controversial in many cases), those are rather dirty workarounds to remove annoying delays on developer's machine and possibly on other non-cloud deployments (note that disabling azure_vm
resource detection seemingly has zero impact on functionality when run not on an Azure VM, but still saves 5 s of time).
Regarding statsbeat, the problem seems not in the statsbeat itself, but in the way it is initialized, because it is again trying to reach that metadata endpoint, after which it works without delays.
As for the cohort of customers, I think, in the first place those are Azure application developers, because before instrumentation makes its way to the cloud it must be coded on the developers's machine first (and this is my case). Having 10 s delay on startup is quite annoying tbh. Secondly, a category of customers could be those who are running the software on-prem but have other ties to Azure, and are willing to send logs to the cloud. Probably having a parameter disabling all Azure-bound tests would be enough.
@greatvovan
I agree those are great points. We will investigate a way to optimize this so users will have a better experience.
https://github.com/Azure/azure-sdk-for-python/pull/35039 and https://github.com/open-telemetry/opentelemetry-python-contrib/pull/2387 reduces the timeout to 200ms. This should alleviate some of the start up time in cases where users are not in an Azure vm.
Describe the bug
With default configuration (absence of configuration), time of execution of
configure_azure_monitor()
takes abnormally long time: ~10 seconds.To Reproduce
long.py:
APPLICATIONINSIGHTS_CONNECTION_STRING="..." python long.py
Expected behavior
Reasonable time to configure (< 1 s).
Additional context
After running in debugger I discovered two main code places contributing to the delay, and both are related to checking the fact of running in an Azure VM.
Location: https://github.com/open-telemetry/opentelemetry-python-contrib/blob/37aba928d45713842941c7efc992726a79ea7d8a/resource/opentelemetry-resource-detector-azure/src/opentelemetry/resource/detector/azure/vm.py#L77
The way code gets there:
Then in https://github.com/open-telemetry/opentelemetry-python-contrib/blob/main/resource/opentelemetry-resource-detector-azure/src/opentelemetry/resource/detector/azure/vm.py
Location: https://github.com/Azure/azure-sdk-for-python/blob/a9b8513bb26ce0e9dc4eac15dcf897127bd2e954/sdk/monitor/azure-monitor-opentelemetry-exporter/azure/monitor/opentelemetry/exporter/statsbeat/_statsbeat_metrics.py#L212-L215
Call stack:
In both cases the delay is related to requests to this endpoint:
http://169.254.169.254/metadata/instance/compute
though, to different API versions. The first place has request timeout of 4 seconds, and the second place has 5 seconds, which together constitute almost the entire time of the startup delay.
Workarounds
OTEL_EXPERIMENTAL_RESOURCE_DETECTORS=otel
environment variable. If not set, the library sets the default value, that includes App Service and Azure VM.APPLICATIONINSIGHTS_STATSBEAT_DISABLED_ALL=TRUE
The above tweaks bring the configuration time down to ~0.8 s (and with
OTEL_PYTHON_DISABLED_INSTRUMENTATIONS
set toazure_sdk,django,fastapi,flask,psycopg2,requests,urllib,urllib3
it completes under 30 ms).It took me hours to find the above options for fixing the startup time without touching the code. I think we need to make the library friendlier to running in non-Azure environments.