educational-technology-collective / jupyterlab-pioneer

A JupyterLab extension for generating and exporting JupyterLab event telemetry data.
https://jupyterlab-pioneer.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
11 stars 2 forks source link

Adding opentelemetry exporter #29

Closed costrouc closed 9 months ago

costrouc commented 9 months ago

I was really facinated by this project since it enables telemetry on the jupyterlab client to be recorded on the server side. This PR is to add support for an opentelemetry exporter and additionally document how to run jupyterlab with it enabled.

I recommend running jaeger locally (easist way to view traces).

version: "3"

services:
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "4317:4317"
      - "6831:6831/udp"
      - "16686:16686"   

Next install this package along with several additional packages

git clone <this repo>
pip install .
pip install opentelemetry-exporter-otlp-proto-grpc opentelemetry-instrumentation-tornado opentelemetry-distro jupyterlab

Now apply the typical installation of the notebook config detailed in the readme to activate activeEvents and exporters.

Then run jupyterlab with instrumentation notice how we call jupyter-lab and not jupyter lab.

OTEL_SERVICE_NAME=jupyter-otel OTEL_TRACES_EXPORTER=otlp OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:4317  opentelemetry-instrument jupyter-lab

Now you can view traces via jaeger by visiting localhost:16676 make sure to do many things on the notebooks to create events etc.

image

When we dive into one of the traces we see

image

Take this example where we filter for all events which are event=CellExecuteEvent. This is a very basic PR but my hope from this is that this framework could more more aligned with OTEL and allow jupyterlab first class support for monitoring and telemetry. Since it does support http, kafka, influx, console, etc as exporters out of the box. Additionally this allow full monitoring of the tornado service and all api methods called.

cab938 commented 9 months ago

This is interesting @costrouc , we had originally considered using OTEL for everything but it introduced more dependencies than we were willing to add.

In this PR all of the events are added into a single OTEL span. It feels to me (though I am far from an OTEL novice even!) that it would be nice to have a span per notebookSessionID, or perhaps even more fine-grained than that. For instance, we do have the ability to log individual content changes within a code cell, which then could be sub-spans within something larger (like a cell edit event, perhaps identified by subsequent cell change events).

This would be a lot of book keeping on the exporter/server side, which is problematic as clients can just disappear (and thus we need to deal with resource cleanup). But in the short term, is it possible to look up a span based on an identifier such as a notebookSessionId and then add events as child spans? This way individual sessions would align with individual spans which would be nice organizationally I think.

costrouc commented 9 months ago

This is interesting @costrouc , we had originally considered using OTEL for everything but it introduced more dependencies than we were willing to add.

To me this is the power of opentelemetry-instrument making from opentelemetry import trace the only effective dependency. There are many exporters like I mentioned before https://opentelemetry.io/docs/instrumentation/python/exporters/. This approach also instruments much more than just jupyterlab-pioneer as it also allow monitoring all api routes in tornado. Something I didn't add was https://opentelemetry-python-contrib.readthedocs.io/en/latest/instrumentation/tornado/tornado.html#request-response-hooks which effectively allows us to spy on all tornado routes so we could easily mark the api create/delete event for notebooks/kernels etc.

In this PR all of the events are added into a single OTEL span

Each event gets it's own root span which is the POST /jupyterlab-pioneer/export api call. All spans have context/attributes and can register events which have attributes. It is easy to filter on an attribute e.g. notebookSessionId=6700d035-5aa9-4c85-9bad-f2f096a7cdb8. I've attached an example bellow in Jaeger. Also note that I am not recommending Jaeger it is just the easiest for testing. This would easily integrate into grafana tempo, datadog etc. Additionally the power of using opentelemetry is https://opentelemetry.io/docs/collector/ which would allow you so send subsets of spans/events to specific exporters. E.g. send all http traces to datadog and send all notebook events to kafka https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/exporter/kafkaexporter/README.md.

image

For instance, we do have the ability to log individual content changes within a code cell, which then could be sub-spans within something larger (like a cell edit event, perhaps identified by subsequent cell change events).

We we would just create another span for that specific /jupyterlab-pioneer/export event. When it happens.

This would be a lot of book keeping on the exporter/server side, which is problematic as clients can just disappear (and thus we need to deal with resource cleanup).

Exactly which is why notebookSessionId is an attribute so we don't have to do any book keeping. We just construct a query for all traces which have an attribute notebookSessionId=asdf....qwer and then we can calculate things such as creation/deletion of a given notebook session. The beauty is that we don't have to rely on spans for the creation/deletion of sessions since we could fallback to when events actualy happen.

Additionally all telemetry data is queryable. In this specific example jaeger has a query api https://www.jaegertracing.io/docs/1.51/apis/. For example here I query for all traces which started at 1701267557355000, ended by 1701271157355000, for service jupyter-otel, to the POST /jupyterlab-pioneer/export trace with attribute notebookSessionId=22390c5065-b463-4878-806d-46ba05f8f589

curl http://localhost:16686/api/traces?end=1701271157355000&limit=20&lookback=1h&maxDuration&minDuration&operation=POST%20%2Fjupyterlab-pioneer%2Fexport&service=jupyter-otel&start=1701267557355000&tags=%7B%22notebookSessionId%22%3A%22390c5065-b463-4878-806d-46ba05f8f589%22%7D

Sample data

{"data":[{"traceID":"07d5f2a6dba1738cd779791f2159a9b0","spans":[{"traceID":"07d5f2a6dba1738cd779791f2159a9b0","spanID":"3b8ae724bfc34a3e","operationName":"POST /jupyterlab-pioneer/export","references":[],"startTime":1701270979632375,"duration":441,"tags":[{"key":"http.method","type":"string","value":"POST"},{"key":"http.scheme","type":"string","value":"http"},{"key":"http.host","type":"string","value":"localhost:8888"},{"key":"http.target","type":"string","value":"/jupyterlab-pioneer/export"},{"key":"http.client_ip","type":"string","value":"127.0.0.1"},{"key":"net.peer.ip","type":"string","value":"127.0.0.1"},{"key":"tornado.handler","type":"string","value":"jupyterlab_pioneer.handlers.RouteHandler"},{"key":"http.status_code","type":"int64","value":200},{"key":"span.kind","type":"string","value":"server"},
...

There are a ton of tools already built out there for collecting and analyzing telemetry data. This would give an example of a workflow you could run periodically to get all event data matching some condition for X, collect the json, and create meaningful statistics. Additionally these things integrate really well into grafana, etc. where you could unify logs for a particular user, metrics, and these traces.

mengyanw commented 9 months ago

Thank you @costrouc for contributing to this repo! I am excited to add open telemetry as another default exporter to the extension.

mengyanw commented 9 months ago

Thanks again @costrouc! This should be available in the new release v0.1.11 now 🎉