GoogleCloudPlatform / opentelemetry-operations-python

OpenTelemetry Python exporters for Google Cloud Monitoring and Trace
https://google-cloud-opentelemetry.readthedocs.io/en/stable/
Apache License 2.0
64 stars 45 forks source link

Running on multiple AppEngine instances #219

Open mjvankampen opened 1 year ago

mjvankampen commented 1 year ago

I've been setting up OTel metrics with Google Cloud Monitoring for our Django app running on AppEngine.

I used code that looked like this

metrics_exporter = CloudMonitoringMetricsExporter(
    project_id=GOOGLE_CLOUD_PROJECT, add_unique_identifier=True
)
metric_reader = PeriodicExportingMetricReader(
    exporter=metrics_exporter, export_interval_millis=5000
)

resource = Resource.create(
    {
        "service.name": env("GAE_SERVICE", default="cx-api"),
        "service.namespace": "Our Platform",
        "service.instance.id": env("GAE_INSTANCE", default="local"),
        "service.version": env("GAE_VERSION", default="local"),
    }
)
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(tracer_provider)

metrics.set_meter_provider(
    MeterProvider(
        metric_readers=[metric_reader],
        # As GCP only allows writing to a timeseries once per second we need to make sure every instance has a different
        # series by setting a unique instance id
        resource=resource,
    )
)

I thought that by using a unique instance ID I would get around the One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric issue. But it seems I really need to use add_unique_identifier=True. While I understand this for multithreaded applications or multiple exporters for the same resource, I don't understand it if the resource is already unique.

mjvankampen commented 1 year ago

Even with add_unique_identifier set to true sometimes the error still pops up

mjvankampen commented 1 year ago

I think I found why it still sometimes happens. This coincides with a scale-down of an AppEngine instance. This kind of makes sense if metrics are flushed on shutdown.

aabmass commented 1 year ago

Is your app being run in multiple processes e.g. gunicorn pre-fork model?

If not, then the shutdown final flush is the likely culprit. The shortest allowed interval to publish metrics to Cloud Monitoring is 5s. If you're app flushes metrics before shutdown and the previous export ran within the last 5s you can see this error. Do you see any issues in your dashboards/metrics?

nsaef commented 1 year ago

Hey, I hope it's okay to latch onto this question. What would the answer be if the app was run in gunicorn? I'm currently testing exporting metrics to Cloud Monitoring and during local development (with a single-process flask development server), everything works fine. But when I deploy to Cloud Run (gunicorn with multiple workers), I frequently get One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.

I'm setting up the reader/exporter with add_unique_identifier=True, like this:

PeriodicExportingMetricReader(
    CloudMonitoringMetricsExporter(add_unique_identifier=True),
    export_interval_millis=5000,
)

Any tips on how to avoid this? Thanks a lot!

kendra-human commented 1 year ago

Hi! Thanks for the lib. I see a similar issue:

One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point

This is a uvicorn app on Cloud Run with 2 containers, running 2 workers each.

A minimal repro of the code:

exporter = CloudMonitoringMetricsExporter(project_id=gcp_project_id)
reader = PeriodicExportingMetricReader(exporter)
detector = GoogleCloudResourceDetector(raise_on_error=True)
provider = MeterProvider(metric_readers=[reader], resource=detector.detect())
meter = provider.get_meter(name="api")
# The meter object is injected into multiple classes, each which create their own instruments, e.g.:
latency = meter.create_histogram(name="api.latency", unit="ms")
aabmass commented 1 year ago

I'd really like to fix this in an automatic way but not sure the best way to go about it. Are you able to defer the initialization of OpenTelemetry MeterProvider until after the workers start (post-fork)?

nioncode commented 4 weeks ago

We run on Cloud Run and also get the error Points must be written in order. One or more of the points specified had an older start time than the most recent point..

This only gets fixed by using CloudMonitoringMetricsExporter(add_unique_identifier=True). This seems weird, since our resource should be set up with unique labels, since we have 1) the Cloud Run instance id and 2) the process id:

        resource = get_aggregated_resources(
            [GoogleCloudResourceDetector(raise_on_error=True)],
            initial_resource=Resource.create(attributes={
                SERVICE_INSTANCE_ID: f"worker-{os.getpid()}",
            }),
        )

        gcp_monitoring_exporter = CloudMonitoringMetricsExporter(
            add_unique_identifier=True
        )
        metric_reader = PeriodicExportingMetricReader(gcp_monitoring_exporter)
        meter_provider = MeterProvider(
            metric_readers=[metric_reader],
            resource=resource,
        )
        metrics.set_meter_provider(meter_provider)

Is it expected that you always need to set add_unique_identifier=True or how do we know when this is really required (and does it do any harm?)?