Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.36k stars 2.71k forks source link

[Tracing] Way of excluding specific spans from sending to Application Insights #31292

Open macieyng opened 11 months ago

macieyng commented 11 months ago

Is your feature request related to a problem? Please describe. We've implemented tracing in our application. We have distributed tracing as well as internal tracing. Currently we want to reduce the cost of data ingested by Application Insights. The most common dependency that is being sent to AI are requests to *.documents.azure.com:443 and Container.query_items dependency. It's like 70% of everything we sent to AI. We run this for 6 months now and we see no anomalies, we learned from it what we wanted to learn and we no longer want to trace Cosmos related dependencies.

So far I didn't found any article that would describe how to suppress those spans. I was able to find how Container.query_items is created, but there's no if that could do the job. At the same time, I couldn't find how *.documents.azure.com:443 dependency is created - at first I assumed this is being made by RequestsIstrumentor, but OTEL_PYTHON_REQUESTS_EXCLUDED_URLS variable doesn't seem to be working for this dependency (I tried two regexes '[\w\d-]*\.documents\.azure\.com[:443]*.*' and .*.documents.azure.com:443.* - both seem to be valid).

Describe the solution you'd like

Describe alternatives you've considered I haven't.

Additional context I could work on this issue once it's decided what should be done. Mentioning @lzchen and @jeremydvoss as they know a lot about tracing in Azure.

lzchen commented 11 months ago

Just to get more context, you are using RequestsInstrumentor in conjuction with some Azure sdks (like Cosmos-db) correct? Could you paste a code snippet of the relevant components and sdk calls?

github-actions[bot] commented 11 months ago

Hi @macieyng. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

macieyng commented 11 months ago

Sure.

So first of all, we have FastAPI app and Azure Functions app.

Dependencies

We use more or less similar SDKs, like:

azure-core = "<2.0"
azure-core-tracing-opentelemetry = "*"
azure-cosmos = "<5.0"
azure-functions = "<2.0"
azure-monitor-opentelemetry-exporter = "*"
azure-servicebus = "<8.0"
azure-storage-blob = "<13.0"
azure-storage-queue = "<13.0"

and OTEL deps currently at 1.19.0/0.40b

"opentelemetry-api",
"opentelemetry-instrumentation-asgi",
"opentelemetry-instrumentation-fastapi",
"opentelemetry-instrumentation-httpx",
"opentelemetry-instrumentation-redis",
"opentelemetry-instrumentation-requests",
"opentelemetry-instrumentation-sqlalchemy",
"opentelemetry-instrumentation",
"opentelemetry-sdk",

Setup

Our FastAPI setup:

def api_tracing(app) -> None:
    resource = Resource.create(
        {
            "service.name": os.getenv("WEBSITE_SITE_NAME", ""),
            "service.instance.id": os.getenv("WEBSITE_INSTANCE_ID", ""),
        }
    )

    # Configure tracing
    trace_exporter = AzureMonitorTraceExporter(disable_offline_storage=True)
    processor = BatchSpanProcessor(trace_exporter)
    sampler = ApplicationInsightsSampler(1.0)
    provider = TracerProvider(resource=resource, sampler=sampler)
    provider.add_span_processor(processor)
    otel_trace.set_tracer_provider(provider)

    FastAPIInstrumentor.instrument_app(app, tracer_provider=provider)
    RequestsInstrumentor().instrument()
    RedisInstrumentor().instrument()
    SQLAlchemyInstrumentor().instrument(enable_commenter=True, commenter_options={})
    HTTPXClientInstrumentor().instrument()

Azure Functions setup looks more or less like in https://github.com/Azure/azure-sdk-for-python/issues/29672, except for instrumentations:

cls._instrumentors: List[BaseInstrumentor] = [
    HTTPXClientInstrumentor,
    SQLAlchemyInstrumentor,
    RedisInstrumentor,
    RequestsInstrumentor,
]

Env variables

We set all required environment variables

AZURE_SDK_TRACING_IMPLEMENTATION=opentelemetry
AZURE_TRACING_ENABLED=true

AZURE_TRACING_ENABLED as per https://github.com/Azure/azure-sdk-for-python/blob/main/sdk/core/azure-core/azure/core/settings.py#L452-L457

Currently, we set excluded urls as follows

OTEL_PYTHON_REQUESTS_EXCLUDED_URLS=".*.in.applicationinsights.azure.com/.*,.*.documents.azure.com/.*"
OTEL_PYTHON_FASTAPI_EXCLUDED_URLS="health"

SDK Usage

When it comes to how we use Cosmos SDK it's more or less:

...
cosmos_client = CosmosClient.from_connection_string("")
db: DatabaseProxy = cosmos_client.get_database_client("")
container: ContainerProxy = db.get_container_client("")

items = container.query_items("select * from c")
...

so I think that's not something unusual.

Examples

And here are examples of how we see this in AI:

I hope that this include all relevant informations that you expected.

pvaneck commented 10 months ago

Hey, having a way to exclude specific spans from being sent to App Insights does sound like something that would be useful. However, I'll let @lzchen and @jeremydvoss be the judge.

To provide more info: since it appears that tne azure-core-tracing-opentelemetry plugin is being used, the spans that you are getting from it are any SDK client methods that are decorated with the distributed_trace decorator. And Azure SDK clients generally use the DistributedTracingPolicy for tracing HTTP calls. These seem to be source of the ContainerProxy.query_items and the *.documents.azure.com* spans. Currently there is no way to suppress these at a granular level.

One thing to note that is that when inside one the spans created by the azure-core, automatic HTTP instrumentation provided by other libraries (such as opentelemetry-instrumentation-requests) are suppressed since azure-core already instruments HTTP calls. This is to prevent duplicate HTTP spans from being produced. I will make sure to document this. This might be why some values specified in OTEL_PYTHON_REQUESTS_EXCLUDED_URLS don't seem to be working.

If you are mainly interested in the HTTP calls, you could purely rely on the RequestsIntrumentor and unset the AZURE_SDK_TRACING_IMPLEMENTATION envvar. However you will lose the automatic span creation for SDK client methods.

macieyng commented 10 months ago

To provide more info: since it appears that tne azure-core-tracing-opentelemetry plugin is being used, the spans that you are getting from it are any SDK client methods that are decorated with the distributed_trace decorator. And Azure SDK clients generally use the DistributedTracingPolicy for tracing HTTP calls. These seem to be source of the ContainerProxy.query_items and the .documents.azure.com spans. Currently there is no way to suppress these at a granular level.

Seems like the option for us is to swap DistributedTracingPolicy.on_request with our own implementation.

One thing to note that is that when inside one the spans created by the azure-core, automatic HTTP instrumentation provided by other libraries (such as opentelemetry-instrumentation-requests) are suppressed since azure-core already instruments HTTP calls.

I think that it would be worth to give users a way to choose requests instrumentation. Smthing like:

If you are mainly interested in the HTTP calls, you could purely rely on the RequestsIntrumentor and unset the AZURE_SDK_TRACING_IMPLEMENTATION envvar. However you will lose the automatic span creation for SDK client methods.

This is not an option for us, because we will loose a lot of valuable information.

lzchen commented 10 months ago

@pvaneck

One thing to note that is that when inside one the spans created by the azure-core, automatic HTTP instrumentation provided by other libraries (such as opentelemetry-instrumentation-requests) are suppressed since azure-core already instruments HTTP calls. This is to prevent duplicate HTTP spans from being produced. I will make sure to document this. This might be why some values specified in OTEL_PYTHON_REQUESTS_EXCLUDED_URLS don't seem to be working.

I believe the reason why OTEL_PYTHON_REQUESTS_EXCLUDED_URLS is having no effect is because the spans created are being created by azure-core-tracing-opentelemetry and not opentelemetry-requests-instrumentation. The env var only affects calls made with the latter.

How does azure-core-tracing-opentelemetry trace http calls under the hood? Is it using any instrumentation or are spans being created manually?

Seems like we can either:

  1. Have azure-core-tracing-opentelemetry allow granular control of urls, so users would set exclude urls for both OpenTelemetry requests instrumentation and azure-core
  2. have azure-core-tracing-opentelemetry support and respect OTEL_PYTHON_REQUESTS_EXCLUDED_URLS
pvaneck commented 10 months ago

How does azure-core-tracing-opentelemetry trace http calls under the hood? Is it using any instrumentation or are spans being created manually?

Most (if not all) Azure SDK clients have spans for HTTP calls explicitly created via DistributedTracingPolicy.

Since azure-core can use other transports that aren't necessarily requests, supporting OTEL_PYTHON_REQUESTS_EXCLUDED_URLS probably wouldn't be the best approach, but we could potentially just use some other envvar.

@lmolkova, it would be good to get your insight on this scenario. Does tracing in the java sdk allow any level of span creation exclusion?

macieyng commented 10 months ago

I just want to remind that what was discussed so far is about one of the aspects of this feature request which is transport/http spans. The other aspect is class/methods/function calls like Container.query_items. I don't want us to lost track of this and possibly find a solution as to how we can solve both.

I could see us doing this in a similar way to how otel instrumentations typically deal with those.

Now, the thing that I'm wondering about is whether we can achieve something like requests are sent with traceparent in header, but the span is not populated to the exporter - would there be any advantage in that? Like for debugging on Azure side?

As for tracing function calls:

That's just an outline of what I was thinking about in terms of implementation.

lmolkova commented 10 months ago

@lmolkova, it would be good to get your insight on this scenario. Does tracing in the java sdk allow any level of span creation exclusion?

In Java we provide an API to disable tracing for specific instances of Azure clients

  TracingOptions tracingDisabled = new OpenTelemetryTracingOptions().setEnabled(false);

  AzureClient sampleClient = new AzureClientBuilder()
      .endpoint("https://my-client.azure.com")
      .clientOptions(new ClientOptions().setTracingOptions(tracingDisabled))
      .build();

  sampleClient.methodCall("get items");

It seems a similar approach can be used here and would allow more fine-grained control (e.g. in future we can allow disabling HTTP spans or logical spans separately).

Update: we have a similar mechanism on .NET where client options have a flag allowing to disable tracing for this specific client instance.

macieyng commented 9 months ago

Hi, can we get an info if this request is processed, if there is an estimation when it might be done, where it's in backlog - like few weeks, months or not a priority atm. Thanks!

pvaneck commented 9 months ago

Hey, @macieyng. Thanks for the reminder. I do think we should still proceed with adding some way to exclude span creation. The approach you outlined would probably offer the most flexibility, but it's something I need to get feedback on from the Python architects since we'd need to introduce new environment variables.

Alternatively, I think a tenable approach that falls more in line with the Java/.NET SDK approaches is to allow some boolean kwarg such as disable_tracing into SDK client methods. Since client method kwargs are propagated to both the decorator and DistributedTracingPolicy, we could check for this value, and stop span creation if True. With this, span creation can be disabled on a per-operation basis. Example:

client = BlobServiceClient.from_connection_string(connection_string)
containers = client.list_containers(name_starts_with="foo", disable_tracing=True)
for container in containers:
   ...
# No spans created!

In any case, I will get architect input this week, and get back to you!

macieyng commented 9 months ago

Thanks for the response.

Regarding the alternative you presented I would probably prefer to enable it on demand over disabling it. Something like this:

client = BlobServiceClient.from_connection_string(connection_string, disable_tracing=True)
containers = client.list_containers(name_starts_with="foo", force_tracing=True)
for container in containers:

So the behavior would be: everything trace in this class is not traced, unless user is explicitly telling client method to be traced. This would allow more granular selection. Also users would have to make less changes if they want to disable everything for ie BlobServiceClient.

I'm still interested in variables though, because they give SRE and DevOps teams more control over what is sent out without touching the code.

macieyng commented 9 months ago

@pvaneck I just came with an idea. Is it possible to create custom SpanProcessor that would on_end discard spans that match reqex? Not sure right now how to implement this right now... 🤔

lzchen commented 9 months ago

The SpanProcessors receive a ReadableSpan for their on end function. You could create a custom spanprocessor that overrides the BatchSpanProcessor to not append to the queue for export depending on span name regex.

pvaneck commented 9 months ago

So the behavior would be: everything trace in this class is not traced, unless user is explicitly telling client method to be traced. This would allow more granular selection. Also users would have to make less changes if they want to disable everything for ie BlobServiceClient.

Was initially thinking that, but currently it's a bit tricky at the tracing decorator level to determine whether some client-level tracing flag is enabled or disabled. Would need to set some property that could be consistently checked across all the various clients from within the decorator, but would need to explore that more. At the operation level, since kwargs is propagated to both the distributed tracing policy and the decorator, checking for some tracing enablement flag is straightforward.

@pvaneck I just came with an idea. Is it possible to create custom SpanProcessor that would on_end discard spans that match reqex? Not sure right now how to implement this right now... 🤔

Hah, not sure why I didn't think of this as a possible solution in the first place, but I think a custom span processor would probably be the way to go for finer-grained control over what actually gets exported. I tested what @lzchen suggested while running operations from the azure-monitor-query library.

import re

from opentelemetry.sdk.trace import ReadableSpan
from opentelemetry.sdk.trace.export import BatchSpanProcessor

class CustomSpanProcessor(BatchSpanProcessor):

    EXCLUDED_SPAN_NAMES = ['.*query_resource.*', '.*query_workspace.*']
    EXCLUDED_URLS = ['.*api\.loganalytics\.io.*']

    def on_end(self, span: ReadableSpan) -> None:
        for regex in self.EXCLUDED_SPAN_NAMES:
            if re.match(regex, span.name):
                return
        if 'http.url' in span.attributes:
            for regex in self.EXCLUDED_URLS:
                if re.match(regex, span.attributes['http.url']):
                    return
        super().on_end(span)

Adding this as my span processor, I was able to exclude select generated spans from being exported.

macieyng commented 9 months ago

I will push this to our dev environment today. Let's see if our costs can drop thanks to those few lines of code 🤔

macieyng commented 9 months ago

This what we pushed to our dev environment:

class RegexExcludeSpanProcessor(BatchSpanProcessor):
    EXCLUDED_SPAN_NAMES = [".*ContainerProxy.*", ".*http send.*"]
    EXCLUDED_URLS = [
        ".*/dbs/.*/colls/.*/docs/",
        ".*/health",
        ".*documents\.azure\.com.*",  # noqa: W605
    ]

    @lru_cache()
    def _get_excluded_span_names(self):
        env_excluded_names = os.getenv("TRACING_EXCLUDED_SPAN_NAMES", "")
        if not env_excluded_names:
            return self.EXCLUDED_SPAN_NAMES
        if env_excluded_names.endswith(","):
            env_excluded_names = env_excluded_names[:-1]
        list_of_excluded = self.EXCLUDED_SPAN_NAMES + env_excluded_names.split(",")
        return list_of_excluded

    @lru_cache()
    def _get_excluded_urls(self):
        env_excluded_urls = os.getenv("TRACING_EXCLUDED_URLS", "")
        if not env_excluded_urls:
            return self.EXCLUDED_URLS
        if env_excluded_urls.endswith(","):
            env_excluded_urls = env_excluded_urls[:-1]
        list_of_excluded = self.EXCLUDED_URLS + env_excluded_urls.split(",")
        return list_of_excluded

    def on_end(self, span: ReadableSpan) -> None:
        for regex in self._get_excluded_span_names():
            if re.match(regex, span.name):
                return
        if not span.attributes:
            super().on_end(span)
            return
        if "http.url" in span.attributes:
            for regex in self._get_excluded_urls():
                if re.match(regex, str(span.attributes["http.url"])):
                    return
        super().on_end(span)

I needed to add a few things here and there to comply with mypy.

And here are tests loosely based on BatchSpanProcessor tests

import os
import time

from opentelemetry.sdk import trace
from opentelemetry.sdk.trace import export

from nano_shared.tracing.helpers import RegexExcludeSpanProcessor

class MySpanExporter(export.SpanExporter):
    """Very simple span exporter used for testing."""

    def __init__(
        self,
        destination,
        max_export_batch_size=None,
        export_timeout_millis=0.0,
        export_event=None,
    ):
        self.destination = destination
        self.max_export_batch_size = max_export_batch_size
        self.is_shutdown = False
        self.export_timeout = export_timeout_millis / 1e3
        self.export_event = export_event

    def export(self, spans):
        print("Exporting spans", spans)
        if (
            self.max_export_batch_size is not None
            and len(spans) > self.max_export_batch_size
        ):
            raise ValueError("Batch is too big")
        time.sleep(self.export_timeout)
        self.destination.extend(span.name for span in spans)
        if self.export_event:
            self.export_event.set()
        return export.SpanExportResult.SUCCESS

    def shutdown(self):
        self.is_shutdown = True

class TestRegexExcludeSpanProcessor:
    def test_regex_exclude_span_processor_with_defaults(self):
        tracer_provider = trace.TracerProvider()
        tracer = tracer_provider.get_tracer(__name__)

        spans_names_list = []

        my_exporter = MySpanExporter(destination=spans_names_list)
        span_processor = RegexExcludeSpanProcessor(my_exporter)
        tracer_provider.add_span_processor(span_processor)

        with tracer.start_as_current_span("This should be exported"):
            with tracer.start_as_current_span("ContainerProxy.query_items"):
                with tracer.start_as_current_span("/health http send"):
                    with tracer.start_as_current_span("request to azure") as span:
                        span.set_attribute(
                            "http.url", "documents.azure.com:433/dbs/ACCOUNT/"
                        )

        span_processor.shutdown()
        assert my_exporter.is_shutdown is True

        assert ["This should be exported"] == spans_names_list

    def test_regex_exclude_span_processor_with_envs(self):
        os.environ["TRACING_EXCLUDED_SPAN_NAMES"] = r"ServiceBusSender,MyClass,"
        os.environ["TRACING_EXCLUDED_URLS"] = r".*azure\.com.*,"

        tracer_provider = trace.TracerProvider()
        tracer = tracer_provider.get_tracer(__name__)

        spans_names_list = []

        my_exporter = MySpanExporter(destination=spans_names_list)
        span_processor = RegexExcludeSpanProcessor(my_exporter)
        tracer_provider.add_span_processor(span_processor)

        with tracer.start_as_current_span("MyClass"):
            with tracer.start_as_current_span("ServiceBusSender.send"):
                with tracer.start_as_current_span(
                    "This should not be included"
                ) as span_1:
                    with tracer.start_as_current_span("This should be included"):
                        with tracer.start_as_current_span(
                            "This should not be included"
                        ) as span_2:
                            span_1.set_attribute("http.url", "https://www.azure.com")
                            span_2.set_attribute(
                                "http.url", "https://www.weird.link.azure.com"
                            )

        span_processor.shutdown()
        assert my_exporter.is_shutdown is True

        assert ["This should be included"] == spans_names_list
macieyng commented 9 months ago

This is sooooooo cooool. Dependencies from one service cut in half!

Zrzut ekranu 2023-09-21 o 17 21 13

At this point I think this is a way to go. Not sure how this would affect the performance though. This kind of processor could be added to the SDK for folks to use. Or BatchProcessor could be changed to provide this kind of functionality.

Many thanks guys! We made my manager very very happy 😉

pvaneck commented 9 months ago

Thanks for sharing your idea / code / tests, and I'm glad you got it working! I think it'd be helpful if we have a code sample and/or documentation to show how to handle this type of use case.

Also, one thing to note is that when the OTel Semantic conventions goes stable, the tracing plugin will adopt the new http attribute naming: https://opentelemetry.io/docs/specs/semconv/url/url/ . It might be prudent to preemptively check for url.full (the rename of http.url) in tandem with http.url inside your SpanProcessor.

karpikpl commented 8 months ago

Hey Maciej @macieyng , thanks for all your comments on OTEL in python functions!

Is there a way to use the SpanProcessor you implemented when you use configure_azure_monitor and not the exporter directly? Can you control log levels by environment variables or you have to find each logger by name a set it's level?

logging.getLogger("azure.core.pipeline.policies.http_logging_policy").setLevel(logging.WARNING)

I'm trying to remove all unnecessary logs like Received FunctionInvocationRequest etc. In .NET I could just configure log levels in host.json or via logLevel env variables.

johschmidt42 commented 2 months ago

Thanks for the solution @pvaneck.

Would you use the same approach if certain exceptions need to be excluded? Or is there a better way offered by azure or opentelemetry that I didn't find? I'd like to exclude CosmosResourceNotFoundError from the azure-cosmos library. At the moment I check whether the span.description contains "CosmosResourceNotFoundError". But maybe there are better attributes for this?

from opentelemetry.sdk.trace import ReadableSpan
from opentelemetry.sdk.trace.export import BatchSpanProcessor

class CustomSpanProcessor(BatchSpanProcessor):
    def on_end(self, span: ReadableSpan) -> None:
        if isinstance(span.status.description, str) and "CosmosResourceNotFoundError" in span.status.description:
            return

        super().on_end(span=span)
lzchen commented 2 months ago

@johschmidt42

Most users use the span.name to identify certain spans to ignore, but if your way works and it is the unique identifier of the spans you don't want this is fine as well.