Financial-Times / dotcom-reliability-kit

🪨 A well tested suite of tools designed to help FT.com applications be more reliable and measurable
MIT License
7 stars 1 forks source link

[Opentelemetry plugin] Specify api.ft.com metrics for underneath APIs #1191

Open gyss opened 2 months ago

gyss commented 2 months ago

We'd like to have added a hook in the Opentelemetry plugin that adds specificity to the api.ft.com net_peer_name attribute. So we can select the metrics specific to an API that lives inside the API Gateway

What problem does this feature solve?

Due to the ongoing Graphite migration, we need to replace some of the system's Graphite based alerts with Opentelemetry.

Some of these alerts do need to trigger when a specific service is not working as it should. For example, let's see this alert in next-myft-api

    {
        name: 'Insights API is serving topic recommendations',
        id: 'snr-recs-availability',
        type: 'graphiteThreshold',
        metric: 'asPercent(sumSeries(next.heroku.myft-api.*.fetch.insights-api-topic-recommendations.response.status_{5xx,4xx}.count), sumSeries(next.heroku.myft-api.*.fetch.insights-api-topic-recommendations.response.status_200.count))',
        threshold: 10,
        severity: 2,
        businessImpact: 'Topic recommendations in myft pages or in some daily digest email might be missing.',
        technicalSummary: 'Calls to insight API to get topic recommendations are returning 4xx and 5xx results',
        panicGuide: 'Check if there is any issue with Insights API with Search and Recommendation team. You can get more details on Splunk with the query `index=heroku source="next-myft-api" sourcetype="heroku:app" level=error path="/v2/recommendation/user/*/concept"`. It is possible to modify the trafic to get all recommendations from neo4j: in Doppler, set up the env PERCENT_SNR_TOPIC_RECS to 0 and make sure FORCE_SNR_TOPIC_RECS is false. That will stop the calls to Insight API.',
    }

Graphite health check

The metrics inside next.heroku.myft-api.*.fetch.insights-api-topic-recommendations correspond to the requests done to the service in this URL https://api.ft.com/snr/v1/insights/topic-recommendations/, which lives inside of the API Gateway. But when creating the Opentelemetry Grafana panel for this alert, we find that net_peer_name doesn't allow us to select topic-recommendations, only api.ft.com because this label is domain based.

Screenshot 2024-09-06 at 09 58 26

Link to Grafana panel

This means that if next-myft-api uses many services that are inside API Gateway (which it does), all the metrics are bundled together and we can't tell where the 5xx or 4xx are coming from

Ideal solution

As suggested by Sam Parkinson here, we could add a hook in the Opentelemetry plugin to add the first part of the path to the net_peer_name attribute. In this case it would be api.ft.com/snr, which works for us because next-myft-api doesn't use any other API from snr. But maybe we could consider adding the last part of the path, eg : api.ft.com/topic-recommendations

Anyways, this is a problem shared in many repositories (and probably teams), and it'd be amazing to have this solved in the Opentelemetry plugin

Alternatives

An alternative is to code this OTel hook in every system that needs this granularity.

rowanmanning commented 2 months ago

Hiya, status update on where I've got to so far. This is hard! I'm encountering two issues:

HTTP vs Undici

I set up a test app locally which makes requests via both Undici and node-fetch – these are the two most common ways we make client requests from our apps. I can get the Undici request to log but I don't seem to be able to hook into the HTTP (node-fetch) request at all. How is this working?? @sjparkinson this is making me question everything - are we definitely getting metrics from node-fetch? 😂

Here's what I've been trying (I've also tried every other option under @opentelemetry/instrumentation-http):

getNodeAutoInstrumentations({
    '@opentelemetry/instrumentation-http': {
        requestHook: (span, request) => {
            console.log('HTTP REQUEST HOOK', request.url);
        }
    },
    '@opentelemetry/instrumentation-undici': {
        requestHook: (span, request) => {
            console.log('UNDICI REQUEST HOOK', request.path);
        }
    }
    // ...
})

When I make the following requests:

const nFetch = require('node-fetch');

const bulbasaur = await fetch('https://pokeapi.co/api/v2/pokemon/bulbasaur'); // native fetch
const squirtle = await nFetch('https://pokeapi.co/api/v2/pokemon/squirtle'); // node-fetch

I get only this log:

UNDICI REQUEST HOOK /api/v2/pokemon/bulbasaur

I might be missing something, maybe there's a different instrumentation library that handles HTTP client requests? Seems unlikely though.

Custom attributes on existing metrics

Turns out that the instrumentations for Node.js are quite strict. Ignoring the HTTP issue I decided to try and add custom metrics to native fetches. Attributes would get added to the http.client.request.duration metric.

If I try adding a custom attribute that isn't explicitly supported or even if I add the one experimental attribute (url.template) then I see nothing. So adding the following:

getNodeAutoInstrumentations({
    // ...
    '@opentelemetry/instrumentation-undici': {
        startSpanHook(request) {
            return {
                ['url.template']: '/hello-world'
            };
        }
    }
    // ...
})

The metric that gets exported is like this:

{
    // ...
    attributes: [
        {
            'http.response.status_code': 200,
            'http.request.method': 'GET',
            'server.address': 'pokeapi.co',
            'server.port': 443,
            'url.scheme': 'https'
        }
    ]
}

However if I reuse one of the metrics that's already exported then it works:

getNodeAutoInstrumentations({
    // ...
    '@opentelemetry/instrumentation-undici': {
        startSpanHook(request) {
            return {
                ['server.address']: 'why.wont.you.work'
            };
        }
    }
    // ...
})

This exports the metric:

{
    // ...
    attributes: [
        {
            'http.response.status_code': 200,
            'http.request.method': 'GET',
            'server.address': 'why.wont.you.work',
            'server.port': 443,
            'url.scheme': 'https'
        }
    ]
}

Very frustrating. If this is a hard limitation at the moment then I'm not sure what we can do to get paths into our metrics.

sjparkinson commented 2 months ago

If I try adding a custom attribute that isn't explicitly supported or even if I add the one experimental attribute (url.template) then I see nothing.

I wonder if that is true for attributes in the semantic conventions like peer.service? Or just no new attributes to the map.

@sjparkinson this is making me question everything - are we definitely getting metrics from node-fetch? 😂

No idea! Though I have wondered if some systems are a bit light on client request metrics, worth a browse of https://grafana.ft.com/d/HzGduSwIz/node-js?orgId=1&var-workspace=I7o8Aa1Sz&var-job=next-myft-api&var-environment=production&var-cloud_provider=heroku&var-cloud_region=All&var-http_method=All&var-http_route=All&var-http_errors=All&viewPanel=13 for a system with well known dependencies.

But my understanding for node-fetch is that https://github.com/open-telemetry/opentelemetry-js/blob/main/experimental/packages/opentelemetry-instrumentation-http/src/http.ts is covering it, as it's a wrapper.

jonnangle commented 2 months ago

It's a bit light on detail but this issue suggests that load order might be relevant.

sjparkinson commented 2 months ago

Some thoughts on this (assuming it is possible).

There's a helpful list of API endpoints under Tyk at https://apigateway.in.ft.com/check-apis, 179 of them today.

I think there are two options for adding the detail we're looking for:

  1. Using a map of URLs to system codes, add a peer.service attribute to requests

    e.g. a client request to https://api.ft.com/enrichedcontent/... would have peer.service set to enriched-content-read-api.

  2. Update the net.peer.name (at some point this becomes server.address FYI) attribute to include more of the listen path

    e.g. a client request to https://api.ft.com/content/a21abcbe-ae52-40c0-8e8d-858372065e2a would have net.peer.name updated to api.ft.com/content/{id}.

I don't really have a strong list of pros and cons, but my gut feeling is 2 would be most intuitive, and perhaps a little easier to maintain (let's try not to reinvent the next-metrics list right?).

rowanmanning commented 2 months ago

@sjparkinson I've opened a draft PR here which includes a bunch of manual edits and a test app to illustrate the issue: https://github.com/Financial-Times/dotcom-reliability-kit/pull/1202.

  1. Still not getting anything from node-fetch, very interested if you can find a way!

  2. Setting net.peer.name doesn't seem to work

sjparkinson commented 2 months ago

Still not getting anything from node-fetch, very interested if you can find a way!

Reading through how the SDK wraps and how node-fetch is written, I wonder if it's related to node-fetch using node:http and node:https as imports?

https://github.com/node-fetch/node-fetch/blob/8b3320d2a7c07bce4afc6b2bf6c3bbddda85b01f/src/index.js#L9-L10

(I'm sort of seeing this as maybe a good opportunity on figuring out how to write an auto-instrumentation package.)

sjparkinson commented 2 months ago

Or, reading https://github.com/open-telemetry/opentelemetry-js/blob/main/experimental/packages/opentelemetry-instrumentation/README.md#limitations, is it related to node-fetch using ESM imports?

rowanmanning commented 1 month ago

possibly for the latest node-fetch but I'm explicitly using a v2 version that uses CommonJS 😄

sjparkinson commented 1 month ago

Still not getting anything from node-fetch, very interested if you can find a way!

I am also finding the same with a vanilla express app. Traces are captured for the POST request by the @opentelemetry/instrumentation-net package, but nothing from @opentelemetry/instrumentation-http.

Not sure why though. Something to fix somewhere!

I have asked in the #otel-js CNCF channel, join link if you're interested. Can't say I always get support, but worth asking.

sjparkinson commented 1 month ago

I have however been able to use node-fetch@v2 fine with the http example at https://github.com/open-telemetry/opentelemetry-js/tree/main/examples/http. Bit of a fiddle but I installed node-fetch in the client and was able to get the expected trace.

Screenshot 2024-09-19 at 11 41 43
diff --git a/examples/http/client.js b/examples/http/client.js
index 168d43392..0cf4c675f 100644
--- a/examples/http/client.js
+++ b/examples/http/client.js
@@ -2,7 +2,7 @@

 const api = require('@opentelemetry/api');
 const tracer = require('./tracer')('example-http-client');
-const http = require('http');
+const fetch = require('node-fetch');

 /** A function which makes requests and handles response. */
 function makeRequest() {
@@ -10,18 +10,11 @@ function makeRequest() {
   // the span, which is created to track work that happens outside of the
   // request lifecycle entirely.
   tracer.startActiveSpan('makeRequest', (span) => {
-    http.get({
-      host: 'localhost',
-      port: 8080,
-      path: '/helloworld',
-    }, (response) => {
-      const body = [];
-      response.on('data', (chunk) => body.push(chunk));
-      response.on('end', () => {
-        console.log(body.toString());
+    fetch('http://localhost:8080/helloworld')
+      .then((response) => {
+        response.text();
         span.end();
       });
-    });
   });

   // The process must live for at least the interval past any traces that
diff --git a/examples/http/package.json b/examples/http/package.json
index 64a05ed43..3d8638469 100644
--- a/examples/http/package.json
+++ b/examples/http/package.json
@@ -5,10 +5,10 @@
   "description": "Example of HTTP integration with OpenTelemetry",
   "main": "index.js",
   "scripts": {
-    "zipkin:server": "cross-env EXPORTER=zipkin node ./server.js",
-    "zipkin:client": "cross-env EXPORTER=zipkin node ./client.js",
-    "jaeger:server": "cross-env EXPORTER=jaeger node ./server.js",
-    "jaeger:client": "cross-env EXPORTER=jaeger node ./client.js",
+    "zipkin:server": "EXPORTER=zipkin node ./server.js",
+    "zipkin:client": "EXPORTER=zipkin node ./client.js",
+    "jaeger:server": "EXPORTER=jaeger node ./server.js",
+    "jaeger:client": "EXPORTER=jaeger node ./client.js",
     "align-api-deps": "node ../../scripts/align-api-deps.js"
   },
   "repository": {
@@ -37,7 +37,8 @@
     "@opentelemetry/resources": "1.26.0",
     "@opentelemetry/sdk-trace-base": "1.26.0",
     "@opentelemetry/sdk-trace-node": "1.26.0",
-    "@opentelemetry/semantic-conventions": "1.27.0"
+    "@opentelemetry/semantic-conventions": "1.27.0",
+    "node-fetch": "^2.7.0"
   },
   "homepage": "https://github.com/open-telemetry/opentelemetry-js/tree/main/examples/http",
   "devDependencies": {

So it does seem like an issue with auto-instrimentation as the issue Jon linked to suggests?

rowanmanning commented 1 month ago

"HTTP vs Undici" from my original comment is resolved in v2.0.11 of the package but we still need to work out how we want to augment the metrics with our paths. "Custom attributes on existing metrics" is still blocking and we'll either need to find a workaround or try to get the OpenTelemetry maintainers to support custom attributes.