GoogleCloudPlatform / opentelemetry-operations-js

This repository is home to Google Cloud Exporters (Trace and Monitoring) for OpenTelemetry Node.js Project (https://github.com/open-telemetry/opentelemetry-js)
Apache License 2.0
70 stars 62 forks source link

Document usage in Cloud Run functions #750

Open CapoD opened 2 months ago

CapoD commented 2 months ago

Is your feature request related to a problem? Please describe. We are using Firebase functions (2nd gen) that are deployed as Cloud Run functions in GCP. These functions use Angular Server Side Rendering and create lots of HTTP Requests that we would like to trace. Unfortunately, we are not able to find a working example. Thus, we suggest to add one.

Describe the solution you'd like We believe that our code may be a good starting point for a working example (if we get it to work ;-) ) and this issue might also help others.

// compiled to: instrumentation/main.js

import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { core, NodeSDK } from '@opentelemetry/sdk-node';
import { gcpDetector } from '@opentelemetry/resource-detector-gcp';
import { TraceExporter } from '@google-cloud/opentelemetry-cloud-trace-exporter';
import { diag, DiagConsoleLogger, DiagLogLevel, trace } from '@opentelemetry/api';
import { Detector, DetectorSync, IResource, ResourceDetectionConfig } from '@opentelemetry/resources';
import { AlwaysOnSampler } from '@opentelemetry/sdk-trace-base';
import { BatchSpanProcessor } from '@opentelemetry/sdk-trace-node';

// This was a workaround we found since directly using gcpDetector raised an issue with unresolved promises
function awaitAttributes(detector: DetectorSync): DetectorSync | Detector {
  return {
    async detect(config?: ResourceDetectionConfig): Promise<IResource> {
      const resource = detector.detect(config);
      await resource.waitForAsyncAttributes?.();
      console.log('Detected resource: ', resource);
      return resource;
    },
  };
}

// We tried it with and w/o options, but that did not help
const traceExporterOptions = { projectId: '****', resourceFilter: /^service\./ };
const sdk = new NodeSDK({
  sampler: new AlwaysOnSampler(),
  resourceDetectors: [awaitAttributes(gcpDetector)],
  // That seemed to be the right place 
  traceExporter: new TraceExporter(traceExporterOptions),
  // Adding it here allowed us to wrap it with the BatchSpanProcess 
  spanProcessors: [new BatchSpanProcessor(new TraceExporter(traceExporterOptions))],
  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],
});

diag.setLogger(new DiagConsoleLogger(), core.getEnv().OTEL_LOG_LEVEL || DiagLogLevel.VERBOSE);

try {
  sdk.start();
  diag.info('OpenTelemetry instrumentation started successfully');
} catch (error) {
  diag.error(
    'Error initializing OpenTelemetry SDK. Your application is not instrumented and will not produce telemetry',
    error,
  );
}

/*
  Do something
 */
const tracer = trace.getTracer('test-test', '0.1.0');
const span = tracer.startSpan('Test');
span.end();

process.on('SIGTERM', () => {
  sdk
    .shutdown()
    .then(() => diag.debug('OpenTelemetry SDK terminated'))
    .catch((error) => diag.error('Error terminating OpenTelemetry SDK', error));
});
# .env
NODE_OPTIONS='--require ./instrumentation/main.js'

Function's package.json: We use Rspack to bundle the instrumentation in instrumentation/main.js, however the Protocol Buffers in @google-cloud/opentelemetry-cloud-trace-exporter cause problems while bundling, so we exclude them from the bundle and provide them as runtime dependencies.

{
  "name": "ssr",
  "engines": {
    "node": "20"
  },
  "main": "server/server.mjs",
  "dependencies": {
    "@google-cloud/opentelemetry-cloud-trace-exporter": "*"
  }
}

However, the above solution does not work. The function log only writes two lines:

Error: Timeout
    at Timeout.<anonymous> (/workspace/instrumentation/main.js:3:686268)
    at b.<computed> (file:///workspace/server/polyfills.server.mjs:4:12215)
    at s.invokeTask (file:///workspace/server/polyfills.server.mjs:3:7335)
    at ne.runTask (file:///workspace/server/polyfills.server.mjs:3:2678)
    at invokeTask (file:///workspace/server/polyfills.server.mjs:3:8410)
    at E.W.useG.invoke (file:///workspace/server/polyfills.server.mjs:3:8312)
    at E.args.<computed> [as _onTimeout] (file:///workspace/server/polyfills.server.mjs:4:11932)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7)

and then

batchWriteSpans error: 3 INVALID_ARGUMENT: Invalid project resource name projects/[object Promise]

We also do not see any traces in Cloud Trace except of course the Load Balancer and Function Call trace. If you guys have any ideas on what we are doing wrong, we would love to hear them.

Describe alternatives you've considered A working example might be included in the official GCP docs, as well.

dashpole commented 1 month ago

@aabmass just fixed a bundling problem, could possibly be related.

@psx95 has been looking into OTel export from Cloud Run Functions, so might have some insight

aabmass commented 1 month ago

We use Rspack to bundle the instrumentation in instrumentation/main.js, however the Protocol Buffers in @google-cloud/opentelemetry-cloud-trace-exporter cause problems while bundling, so we exclude them from the bundle and provide them as runtime dependencies.

Bundling issues should be fixed in #404 which I'm hoping to release this week.

// This was a workaround we found since directly using gcpDetector raised an issue with unresolved promises
batchWriteSpans error: 3 INVALID_ARGUMENT: Invalid project resource name projects/[object Promise]

I don't think the awaitAttributes wrapper will help. What was the actual error you were seeing about unresolved promises? The OTel Node BatchSpanProcessor should be lazily awaiting resource attributes.

The usual issue we see with Cloud Run and Cloud Run Functions is that CPU gets throttled and breaks background exporting. For Cloud Run we recommend always-allocated CPU but that doesn't seem to be available in Cloud Run functions still.

Is the service experiencing very low QPS or bursty traffic?

psx95 commented 1 month ago

Re always-allocated CPU

As @aabmass mentioned, always-allocated CPU is the recommended approach. For some reason this is not available in "Cloud Run Functions" from Google Cloud Console. However, now there is an option to deploy a function from Cloud Run and if you create a Function from within "Cloud Run" page, you do have the option to use always allocated CPU.

It is a little confusing on the UI side and I imagine there is work going on to reconcile these two options in a single page. Alternatively, you could use this gcloud command to deploy a function with always-allocated CPU:

# This was used to deploy a Java function, so might need some updates in --source flag for JS
gcloud beta run deploy cloud-func-helloworld \
 --no-cpu-throttling \ # this results in always-allocated CPU
 --container app-function \
 --function org.example.HelloWorld \ # this deploys the service as a cloud run function with a given entry point.
 --source=build/libs \
 --port=8080 
CapoD commented 1 month ago

@aabmass:

We use Rspack to bundle the instrumentation in instrumentation/main.js, however the Protocol Buffers in @google-cloud/opentelemetry-cloud-trace-exporter cause problems while bundling, so we exclude them from the bundle and provide them as runtime dependencies.

Bundling issues should be fixed in #404 which I'm hoping to release this week.

I think that bundling is not our issue, it is just a little inconvenience, but I thought I should mention it.

// This was a workaround we found since directly using gcpDetector raised an issue with unresolved promises
batchWriteSpans error: 3 INVALID_ARGUMENT: Invalid project resource name projects/[object Promise]

I don't think the awaitAttributes wrapper will help. What was the actual error you were seeing about unresolved promises? The OTel Node BatchSpanProcessor should be lazily awaiting resource attributes.

At this point we were trying different methods to resolve this error: batchWriteSpans error: 3 INVALID_ARGUMENT: Invalid project resource name projects/[object Promise]

Inspired by: https://github.com/open-telemetry/opentelemetry-js/issues/4638

The usual issue we see with Cloud Run and Cloud Run Functions is that CPU gets throttled and breaks background exporting. For Cloud Run we recommend always-allocated CPU but that doesn't seem to be available in Cloud Run functions still.

Is the service experiencing very low QPS or bursty traffic?

We did not deploy the telemetry to our production, only in development environment so the traffic is one request at a time. The function is configured as shown here: image

CapoD commented 1 month ago

@psx95 :

Re always-allocated CPU

As @aabmass mentioned, always-allocated CPU is the recommended approach. For some reason this is not available in "Cloud Run Functions" from Google Cloud Console. However, now there is an option to deploy a function from Cloud Run and if you create a Function from within "Cloud Run" page, you do have the option to use always allocated CPU.

It is a little confusing on the UI side and I imagine there is work going on to reconcile these two options in a single page. Alternatively, you could use this gcloud command to deploy a function with always-allocated CPU:

# This was used to deploy a Java function, so might need some updates in --source flag for JS
gcloud beta run deploy cloud-func-helloworld \
 --no-cpu-throttling \ # this results in always-allocated CPU
 --container app-function \
 --function org.example.HelloWorld \ # this deploys the service as a cloud run function with a given entry point.
 --source=build/libs \
 --port=8080 

We are not deploying Cloud Run functions directly, we use firebase (cli). Our configuration takes place as follows:

import { onRequest } from 'firebase-functions/v2/https';
import { setGlobalOptions } from 'firebase-functions/v2/options';
import { app } from './app.server';

setGlobalOptions({ region: 'europe-west3', minInstances: 2, concurrency: 8, memory: '4GiB', cpu: 2 });

export const ssrServer = onRequest(app());

There is no option to configure CPU throttling: image

This results in the configuration seen in my previous comment.

psx95 commented 1 month ago

@CapoD I see, sorry I missed that.

It turns out there is a feature request for "always-on CPU" for firebase functions gen2 here.

You should signal your interest in the feature request there as well.

aabmass commented 1 month ago

OK maybe I know what's going on here regarding

However, the above solution does not work. The function log only writes two lines:

Error: Timeout
    at Timeout.<anonymous> (/workspace/instrumentation/main.js:3:686268)
    at b.<computed> (file:///workspace/server/polyfills.server.mjs:4:12215)
    at s.invokeTask (file:///workspace/server/polyfills.server.mjs:3:7335)
    at ne.runTask (file:///workspace/server/polyfills.server.mjs:3:2678)
    at invokeTask (file:///workspace/server/polyfills.server.mjs:3:8410)
    at E.W.useG.invoke (file:///workspace/server/polyfills.server.mjs:3:8312)
    at E.args.<computed> [as _onTimeout] (file:///workspace/server/polyfills.server.mjs:4:11932)
    at listOnTimeout (node:internal/timers:581:17)
    at process.processTimers (node:internal/timers:519:7)

and then

batchWriteSpans error: 3 INVALID_ARGUMENT: Invalid project resource name projects/[object Promise

I think this is coming from https://github.com/GoogleCloudPlatform/opentelemetry-operations-js/blob/b64dbe74e3628b8c4d0cce412e62bacfb5473931/packages/opentelemetry-cloud-trace-exporter/src/trace.ts#L94 which isn't related to the OTel resource or resource detection at all.

I think the cryptic error log and stack trace you're seeing may be coming from where that promise is initialized (although the bundler without source maps makes it hard to know for sure): https://github.com/GoogleCloudPlatform/opentelemetry-operations-js/blob/b64dbe74e3628b8c4d0cce412e62bacfb5473931/packages/opentelemetry-cloud-trace-exporter/src/trace.ts#L61-L65

However I know this is working in Cloud Functions gen 2 because we have e2e tests for it. I wonder if it's also caused by CPU throttling with low QPS right after startup.