Observability plugin system

ajgray-stripe commented 3 weeks ago

👋 Hey, all!

I work at Stripe, and we'd like to experiment with opening up goose for internal usage. As part of that, we'd like a telemetry plugin system, so that we can write a plugin to emit events and metrics into our infrastructure. We noticed that Langfuse support is being worked on right now, and that's the right kind of shape for what we'd want to see, but we don't use Langfuse, and it'd be great if there were a more general system for metrics that both Langfuse and our future telemetry could plug into via a standardized interface.

Happy to implement this myself! But I'd first of all appreciate guidance on whether this plan seems like a reasonable interface to have to that system:

A src/goose/telemetry/base.py file defining an ObservabilityPlugin ABC, with one required method (named something like handle_event)
A general observe_wrapper decorator which is defined in goose code (i.e. reusing the decorator pattern that the current Langfuse integration has, but it's imported from goose rather than from exchange)
That more general decorator farms out each calling event to the handle_event methods of all registered ObservabilityPlugins, where plugins are registered using the same machinery as toolkit plugins currently use

Would really appreciate if anyone could let me know if this makes sense as a reasonable plan to implement this feature!

One might also consider a more abstracted way to implement this functionality -- something like git hooks where various kinds of events occurring can trigger arbitrary functionality. Not sure whether that higher level of abstraction would be a better overall approach -- please let me know what you think!

alicehau commented 3 weeks ago

Hey! Love the idea of making a standardized interface for tracing so people can use the tool of their choice. @codefromthecrypt also has this PR out for adding opentel for tracing.

One reservation about the approach would be pulling the decorator definition into Goose vs. exchange. Since we use the decorator on chat completion functions (which live in exchange) in addition to Goose functions, it would be a bit awkward for exchange to import from Goose.

I think it would be fine to scope this change to focus on observability. We'd welcome the contribution!

michaelneale commented 3 weeks ago

@ajgray-stripe yeah that would be great.

Currently there is langfuse in the mix for LLM for now which you have noticed.

There is also otel: https://github.com/block/goose/pull/75 - a work in progress (more general tracing)

and just for trial I thought of systrace - as one option to wire in things dynamically (others probably have more experience): https://github.com/block/goose/pull/211

so somewhere in there is probably a more general one that people using goose can choose to wire in with their preferred thing (ideally without too much dependencies) - otel could be one or some way to dynamically weave it in from systrace

marcklingen commented 3 weeks ago

I am one of the founders/maintainers of Langfuse (repo). I think the push towards OTel here is great as it helps to standardize instrumentation which can then be collected by LLMOps products like Langfuse but also in any given observability stack. Right now semantic conventions are still developing (ecosystem is not there yet and OTel-based instrumentation is not necessarily interoperable) but I am optimistic that this is the best path forward mid-term. Happy to help here in case you have any questions as this is what I focus on all day and love goose as a project :)

michaelneale commented 3 weeks ago

@marcklingen thanks - that makes sense

codefromthecrypt commented 3 weeks ago

ps (and apologies for being awol. coming off time off soon)

https://github.com/block/goose/pull/75 was just a toe-hold. A more canonical trace, e.g. with instrumentation that captures chat completion data (full body) would look more like this (not exactly as goose is different)

main thing is porting instrumentation like this for openai over to goose/exchange which has its own abstraction, so can't use existing llm instrumentation such as litellm proxy lib, etc.

codefromthecrypt commented 3 weeks ago

On which is explicit instrumentation (a.k.a manual) vs making a tracing abstraction. If someone makes a tracing abstraction, usually they are only covering manual tracing. It is hard to get a tracing abstraction right, and in my experience, they are often bug factories until proven otherwise. Most often these secondary abstractions miss scoping of context so lead to broken traces, have weird failures in async, and often cause more problems than trying to integrate with an existing tool or conditionally load one of two choices.

Above are my opinions and so take it for what it is worth.

long story of https://github.com/block/goose/pull/75, which is explicit/manual tracing, but integrates cleanly with off-the-shelf stuff like httpx instrumentation with no code changes.

In otel, you have "auto-instrumentation" which is implicit, so not visible to the calling code. Then, you can also add tracing where you need it.

So, for one-shot CLI, one issue is to make sure traces flush on exit, so some fixture loaded like this (doesn't matter if it is no-op or not, as the code executing has no problem when disabled)

import atexit
import os

from opentelemetry import trace

tracer = trace.get_tracer("chatbot-rag-app")

# Register a function to send buffered spans when the process exits.
def shutdown_tracer_provider():
    trace.get_tracer_provider().shutdown()

print(os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT"))

atexit.register(shutdown_tracer_provider)

Then, you can use annotation style, or the tracer itself.

@app.cli.command()
# Currently, flask auto-instrumentation does not trace CLI commands
@tracer.start_as_current_span("create_index")
def create_index():

Anything done manually like this contributes to the same trace as implicit instrumentation such as http requests or if we added genai instrumentation, similar to the other half a dozen python otel genai sdks out there (but for goose/exchange). Most of the time, people add manual spans because they want to group traces to a task or something.

codefromthecrypt commented 3 weeks ago

so, tl'dr' on above is proceed with extreme caution on a plugin system especially a transient one for something that's going to go for otel anyway. Remember this "genai semantics" thing is only about the data recorded. The tracer abstraction, transports etc is very stable. So, you can still use otel even if you have doubts about data captured changing. If we go with an meta-abstraction (duplicating otel in some ways), we should have very good tests for each supported option. For example, in my PR I added unit tests for this reason, so each thing we believe the hooks should work with, they should be possible to test with in worst case http recordings, but better integration tests. that's my summary advice and hope it helps!

michaelneale commented 2 weeks ago

@ajgray-stripe yeah I guess the summary is go ahead ! maybe otel is the go if we want a dependency? (and am curious how then can bring in otherthings as needed without necessarily adding to the main dependency list).

really keen to see what you come up with that suits your needs, as this is all moving so fast everyone doesn't quite know what they need yet.

ajgray-stripe commented 1 week ago

Thanks for all the advice, folks! Put up a PR^

The approach I took in there ended up being very similar to the existing langfuse integration, but I believe that it'll be lightweight + general enough to allow different kinds of integrations (in particular otel) to plug in reasonably doably.

codefromthecrypt commented 1 week ago

bear in mind I'm putting the otel stuff on the backburner until the recently released openai integration settles down. This will avoid any thrash as main thing here is about prompt/completion capture iiuc.

unsolicited 2p on any plugin system is "rule of three" and we just won't know if it will work for something else until it is tried. Keeping this experimental while it serves mainly to decouple langfuse has its own value, but usually reserve judgement on if something will work until a POC shows it will. In other words, expect concern areas will be in the tracing side of it (propagation, parameter capture, etc.), but until it is proven multiple times, also expect it to change for each new use of it.

block / goose

Observability plugin system #203