Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://learn.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.54k stars 2.76k forks source link

azure-monitor-opentelemetry configure_azure_monitor function is extremely slow #33671

Closed RobertoPrevato closed 8 months ago

RobertoPrevato commented 8 months ago

Describe the bug The configure_azure_monitor function to configure OpenTelemetry with an instance of Azure Application Insights is surprisingly slow. I am trying to upgrade my code that still uses the old OpenCensus libraries: the old OpenCensus libraries for Python and Azure App Insights still work perfectly fine, but the new one for OpenTelemetry adds a ~16 seconds delay to application startup.

To Reproduce Steps to reproduce the behavior:

  1. Use the example from learn.microsoft.com for using OpenTelemetry with Python, with a valid connection string for an Application Insights resource
  2. Run the example
  3. Profile the example with python -m cProfile main.py
# main.py
CONN_STRING = "<YOUR_APP_INSIGHTS_CONN_STRING>"

# Import the `configure_azure_monitor()` function from the
# `azure.monitor.opentelemetry` package.
from azure.monitor.opentelemetry import configure_azure_monitor

# Import the tracing api from the `opentelemetry` package.
from opentelemetry import trace

# Configure OpenTelemetry to use Azure Monitor with the specified connection
# string.
configure_azure_monitor(
    connection_string=CONN_STRING,
)

# Get a tracer for the current module.
tracer = trace.get_tracer(__name__)

# Start a new span with the name "hello". This also sets this created span as the current span in this context. This span will be exported to Azure Monitor as part of the trace.
with tracer.start_as_current_span("hello"):
    print("Hello, World!")

Expected behavior Configuring OpenTelemetry logging should not add several seconds delay to startup, it should work fast as the older OpenCensus library (opencensus-ext-azure==1.1.12).

Additional context These are the first lines of cProfile output:

         1181612 function calls (1131971 primitive calls) in 16.248 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    697/1    0.018    0.000   16.251   16.251 {built-in method builtins.exec}
        1    0.000    0.000   16.251   16.251 main.py:1(<module>)
        1    0.000    0.000   15.735   15.735 _configure.py:63(configure_azure_monitor)
        4    0.001    0.000   11.300    2.825 __init__.py:158(create)
        4    0.000    0.000   11.104    2.776 __init__.py:372(get_aggregated_resources)
      149   11.102    0.075   11.102    0.075 {method 'acquire' of '_thread.lock' objects}
       28    0.000    0.000   11.101    0.396 threading.py:288(wait)
       12    0.000    0.000   11.096    0.925 _base.py:428(result)
        1    0.000    0.000   10.262   10.262 _configure.py:113(_setup_tracing)
      4/3    0.001    0.000    7.151    2.384 _base.py:74(__init__)
        3    0.000    0.000    7.126    2.375 _statsbeat.py:41(collect_statsbeat_metrics)
        2    0.000    0.000    6.117    3.059 __init__.py:339(__init__)
        1    0.000    0.000    4.069    4.069 __init__.py:384(force_flush)
        1    0.000    0.000    4.069    4.069 __init__.py:551(force_flush)
        1    0.000    0.000    4.069    4.069 __init__.py:357(force_flush)
        1    0.000    0.000    4.069    4.069 __init__.py:306(collect)
        2    0.000    0.000    4.042    2.021 sessions.py:502(request)
        2    0.000    0.000    4.029    2.015 sessions.py:673(send)
        2    0.000    0.000    4.028    2.014 adapters.py:434(send)
        2    0.000    0.000    4.025    2.013 connectionpool.py:595(urlopen)
        2    0.000    0.000    4.023    2.012 connectionpool.py:380(_make_request)
        2    0.000    0.000    3.303    1.652 connection.py:197(_new_conn)
        2    0.000    0.000    3.303    1.652 connection.py:27(create_connection)
        2    3.281    1.641    3.281    1.641 {method 'connect' of '_socket.socket' objects}
        1    0.000    0.000    3.115    3.115 __init__.py:1129(__init__)
        1    0.000    0.000    3.068    3.068 _configure.py:141(_setup_metrics)
        1    0.000    0.000    3.068    3.068 measurement_consumer.py:93(collect)
        4    0.000    0.000    3.066    0.767 instrument.py:127(callback)
        1    0.000    0.000    3.066    3.066 _statsbeat_metrics.py:144(_get_attach_metric)
        1    0.000    0.000    3.066    3.066 _statsbeat_metrics.py:183(_get_azure_compute_metadata)
        1    0.000    0.000    3.066    3.066 api.py:62(get)
        1    0.000    0.000    3.066    3.066 api.py:14(request)
        2    0.000    0.000    3.052    1.526 connection.py:319(request)
        3    0.000    0.000    3.050    1.017 client.py:967(send)
        2    0.000    0.000    3.050    1.525 client.py:1266(endheaders)
        2    0.000    0.000    3.050    1.525 client.py:1028(_send_output)
        1    0.000    0.000    3.050    3.050 connection.py:242(connect)
      110    0.003    0.000    2.250    0.020 __init__.py:1(<module>)
        1    0.000    0.000    2.072    2.072 _configure.py:128(_setup_logging)
        1    0.000    0.000    2.069    2.069 __init__.py:583(__init__)
        1    0.000    0.000    1.001    1.001 __init__.py:516(_receive_metrics)
        1    0.000    0.000    0.989    0.989 _exporter.py:70(export)
        1    0.000    0.000    0.988    0.988 _base.py:168(_transmit)
        1    0.000    0.000    0.988    0.988 _azure_monitor_client_operations.py:57(track)
        1    0.000    0.000    0.978    0.978 _base.py:218(run)
      8/1    0.000    0.000    0.978    0.978 _base.py:76(send)
        1    0.000    0.000    0.977    0.977 _redirect.py:183(send)
        1    0.000    0.000    0.977    0.977 _retry.py:502(send)
        1    0.000    0.000    0.977    0.977 _base.py:108(send)
        1    0.000    0.000    0.977    0.977 _requests_basic.py:316(send)
        1    0.000    0.000    0.778    0.778 connectionpool.py:1088(_validate_conn)
        1    0.000    0.000    0.778    0.778 connection.py:609(connect)
   645/20    0.004    0.000    0.537    0.027 <frozen importlib._bootstrap>:1167(_find_and_load)
   634/10    0.003    0.000    0.536    0.054 <frozen importlib._bootstrap>:1122(_find_and_load_unlocked)
   595/15    0.003    0.000    0.533    0.036 <frozen importlib._bootstrap>:666(_load_unlocked)
   542/10    0.001    0.000    0.532    0.053 <frozen importlib._bootstrap_external>:934(exec_module)
  1335/22    0.001    0.000    0.531    0.024 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
        1    0.000    0.000    0.524    0.524 connection.py:708(_ssl_wrap_socket_and_match_hostname)

Calls to lock.acquire alone make up for 11 seconds delay, with 149 calls

149 11.102 0.075 11.102 0.075 {method 'acquire' of '_thread.lock' objects}

  File "/usr/local/lib/python3.11/threading.py", line 1132, in _wait_for_tstate_lock
    if lock.acquire(block, timeout):
RobertoPrevato commented 8 months ago

Note: I created the issue in this repository because the template of this repository https://github.com/microsoft/ApplicationInsights-Python says to create issues here, with tag monitor - exporter.

xiangyan99 commented 8 months ago

Thanks for the feedback, we’ll investigate asap.

lzchen commented 8 months ago

@RobertoPrevato

Could you set your environment variable OTEL_EXPERIMENTAL_RESOURCE_DETECTORS to azure_app_service and see if that helps with the startup time?

github-actions[bot] commented 8 months ago

Hi @RobertoPrevato. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

RobertoPrevato commented 8 months ago

Hi @lzchen thank you for following this issue.

I tried what you recommended three times, and the situation improves very much - although there is still a ~5 seconds delay. By the way since I see web requests are happening, I confirm that I have a good internet connection.

Running in Bash with: OTEL_EXPERIMENTAL_RESOURCE_DETECTORS="azure_app_service" python -m cProfile main.py

(I paste only the first lines)

Hello, World!
         1122587 function calls (1073027 primitive calls) in 4.908 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    696/1    0.017    0.000    4.910    4.910 {built-in method builtins.exec}
        1    0.000    0.000    4.910    4.910 main.py:1(<module>)
        1    0.000    0.000    4.416    4.416 _configure.py:63(configure_azure_monitor)
        1    0.000    0.000    4.065    4.065 _configure.py:113(_setup_tracing)
      4/3    0.000    0.000    4.046    1.349 _base.py:74(__init__)
        3    0.000    0.000    4.041    1.347 _statsbeat.py:41(collect_statsbeat_metrics)
        1    0.000    0.000    4.023    4.023 __init__.py:384(force_flush)
        1    0.000    0.000    4.023    4.023 __init__.py:551(force_flush)
        1    0.000    0.000    4.023    4.023 __init__.py:357(force_flush)
        1    0.000    0.000    4.023    4.023 __init__.py:306(collect)
        2    0.000    0.000    3.995    1.997 sessions.py:502(request)
        2    0.000    0.000    3.990    1.995 sessions.py:673(send)
        2    0.000    0.000    3.989    1.995 adapters.py:434(send)
        2    0.000    0.000    3.988    1.994 connectionpool.py:595(urlopen)
        2    0.000    0.000    3.987    1.993 connectionpool.py:380(_make_request)
        2    0.000    0.000    3.278    1.639 connection.py:197(_new_conn)
        2    0.000    0.000    3.275    1.637 connection.py:27(create_connection)
        2    3.260    1.630    3.260    1.630 {method 'connect' of '_socket.socket' objects}
        1    0.000    0.000    3.072    3.072 measurement_consumer.py:93(collect)
        3    0.000    0.000    3.071    1.024 instrument.py:127(callback)
        1    0.000    0.000    3.071    3.071 _statsbeat_metrics.py:144(_get_attach_metric)
        1    0.000    0.000    3.071    3.071 _statsbeat_metrics.py:183(_get_azure_compute_metadata)
        1    0.000    0.000    3.071    3.071 api.py:62(get)
        1    0.000    0.000    3.071    3.071 api.py:14(request)
        2    0.000    0.000    3.067    1.534 connection.py:319(request)
        2    0.000    0.000    3.067    1.533 client.py:1266(endheaders)
        3    0.000    0.000    3.067    1.022 client.py:967(send)
        2    0.000    0.000    3.067    1.533 client.py:1028(_send_output)
        1    0.000    0.000    3.067    3.067 connection.py:242(connect)
      110    0.003    0.000    2.162    0.020 __init__.py:1(<module>)
        1    0.000    0.000    0.951    0.951 __init__.py:516(_receive_metrics)
        1    0.000    0.000    0.937    0.937 _exporter.py:70(export)
        1    0.000    0.000    0.937    0.937 _base.py:168(_transmit)
        1    0.000    0.000    0.937    0.937 _azure_monitor_client_operations.py:57(track)
        1    0.000    0.000    0.925    0.925 _base.py:218(run)
      8/1    0.000    0.000    0.925    0.925 _base.py:76(send)
        1    0.000    0.000    0.925    0.925 _redirect.py:183(send)
        1    0.000    0.000    0.925    0.925 _retry.py:502(send)
        1    0.000    0.000    0.924    0.924 _base.py:108(send)
        1    0.000    0.000    0.924    0.924 _requests_basic.py:316(send)
        1    0.000    0.000    0.692    0.692 connectionpool.py:1088(_validate_conn)
        1    0.000    0.000    0.692    0.692 connection.py:609(connect)
   641/16    0.004    0.000    0.503    0.031 <frozen importlib._bootstrap>:1167(_find_and_load)
    633/9    0.003    0.000    0.503    0.056 <frozen importlib._bootstrap>:1122(_find_and_load_unlocked)
   594/14    0.003    0.000    0.500    0.036 <frozen importlib._bootstrap>:666(_load_unlocked)
  1333/20    0.001    0.000    0.500    0.025 <frozen importlib._bootstrap>:233(_call_with_frames_removed)
    541/9    0.001    0.000    0.500    0.056 <frozen importlib._bootstrap_external>:934(exec_module)
        1    0.000    0.000    0.492    0.492 _configure.py:1(<module>)
        1    0.000    0.000    0.481    0.481 connection.py:708(_ssl_wrap_socket_and_match_hostname)

Can you please try, too, to confirm this issue with the startup time?

lzchen commented 8 months ago

@RobertoPrevato

Couple of things to note here. The distro package that you are using azure-monitor-opentelemetry is a one-stop-shop telemetry solution that will provide automatic telemetry collection for logs, distributed tracing and metrics to AppInsights. It is meant to be a comprehensive, not as configurable and "specific set of features" SDK that users can use to achieve the most common telemetry needs, and may run components that you may not need or care for in your specific use case, which in turn may cause the increased startup time to be excessive. If you desire a more piece-wise approach in which you can configure what components you want to use, I recommend instrumenting with the azure-monitor-opentelemetry-exporter which the distro is built on top of and allows you more fine-grained control over components.

On a side note, the components that both SDKs are built on top of are designed to be non-blocking. The telemetry exporting occurs on a separate thread from the main thread, so it is interesting that the force_flush calls are stopping the application from starting up. Are you able to run the same diagnostics on a standlone Python application, not hosted in App Service and see if the behavior is the same?

github-actions[bot] commented 8 months ago

Hi @RobertoPrevato. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

RobertoPrevato commented 8 months ago

@lzchen Thank You for the additional context. I am running that code from my local development machine, Ubuntu 22.04. If you could try running the same example from this learn.microsoft.com page, you should get the same results.

I tried the second way you recommended, and tried running the Export Hello World Log here. Unfortunately I get the same result: 4.6 seconds overhead at start-up.

I had to modify slightly the imports of the example because of this, so finally it looks like this:

"""
An example to show an application using Opentelemetry logging sdk. Logging calls to the
standard Python logging library are tracked and telemetry is exported to application
insights with the AzureMonitorLogExporter.
"""
import os
import logging

from opentelemetry.sdk._logs import (
    LoggerProvider,
    LoggingHandler,
)
from opentelemetry._logs import set_logger_provider
from opentelemetry.sdk._logs.export import BatchLogRecordProcessor

from azure.monitor.opentelemetry.exporter import AzureMonitorLogExporter

logger_provider = LoggerProvider()
set_logger_provider(logger_provider)

exporter = AzureMonitorLogExporter(
    connection_string=os.environ["APPLICATIONINSIGHTS_CONNECTION_STRING"]
)

logger_provider.add_log_record_processor(BatchLogRecordProcessor(exporter))
handler = LoggingHandler()

# Attach LoggingHandler to root logger
logging.getLogger().addHandler(handler)
logging.getLogger().setLevel(logging.NOTSET)

logger = logging.getLogger(__name__)

logger.warning("Hello World!")

# Telemetry records are flushed automatically upon application exit
# If you would like to flush records manually yourself, you can call force_flush()
logger_provider.force_flush()

I will try to investigate a bit more.

RobertoPrevato commented 8 months ago

I verified this (also) depends on the Azure datacenter location. I am located close to Warsaw Poland. For some reason, I get ~5 seconds delay when I use a connection string for an Azure Application Insights instance provisioned in the new datacenter in Poland. The delay is ~3.5 seconds when I use an Azure Application Insights provisioned in the datacenter in West Europe (in the Netherlands).

I also verified that the same happens when I use the older library. I erroneously thought the bigger delay was caused by using a different Python library, instead it was related to using a different Azure datacenter (paradoxically, geographically much closer to me).


I still think something is wrong in the design of the library, though, as I initially got a 16 seconds delay with default settings and without using experimental features, while following instructions in learn.microsoft.com.

stratoukos commented 7 months ago

@lzchen I am facing the same issue of very slow initialization.

I have tried both the "one-stop-shop" solution in azure-monitor-opentelemetry and the lower level azure-monitor-opentelemetry-exporter.

In both cases, the delay is unrelated to the force_flush, and occurs at the exporter initialization, ie:

For reference, the timings I'm getting (from Greece to West Europe region) are:

Are these delayes normal and expected?