GoogleCloudPlatform / opentelemetry-operations-java

Apache License 2.0
71 stars 41 forks source link

<Question> Getting "One or more TimeSeries could not be written" when attempting to emit metrics following the Java MetricsExporter example here #339

Open ZiggyZaggyy opened 4 months ago

ZiggyZaggyy commented 4 months ago

Hi, I am using the exporter in CloudFunction, and I'm seeing the following two errors showing up in the log, particularly after any new deployment update to the CloudFunction:

One or more TimeSeries could not be written: Points must be written in order. One or more of the points specified had an older start time than the most recent point or One or more TimeSeries could not be written: One or more points were written more frequently than the maximum sampling period configured for the metric.: generic_node{namespace:,node_id:,location:global} timeSeries[0,1]

The rough idea of my metrics and exporter configuration is as follows:

public class MetricService {
    Meter meter;
    SdkMeterProvider meterProvider;
    private static MetricService instance;
    private final Map<String, LongCounter> counters;

    private MetricService() {
        MetricExporter metricExporter = GoogleCloudMetricExporter.createWithDefaultConfiguration();
        meterProvider = SdkMeterProvider.builder()
            //the following 5 lines were added manually later to see if that helps with the exception issue
            .setResource(Resource.create(Attributes.of(
                AttributeKey.stringKey("cloud.availability_zone"), "us-east1",
                AttributeKey.stringKey("service.namespace"), "XYZ",
                AttributeKey.stringKey("service.name"), "abcd",
                AttributeKey.stringKey("service.instance.id"), MetadataConfig.getInstanceId()
            )))
            .registerMetricReader(
                PeriodicMetricReader.builder(metricExporter)
                    .setInterval(java.time.Duration.ofSeconds(20))
                    .build())
            .build();

        meter = meterProvider
            .meterBuilder("XYZ-metric-service")
            .setInstrumentationVersion("1.0.0")
            .build();
        counters = new HashMap<>();
    }

    public static MetricService getInstance() {
        if (instance == null) {
            instance = new MetricService();
        }

        return instance;
    }

    public void increment(final String metricName, final long count) {
        LongCounter counter = counters.getOrDefault(metricName, meter
            .counterBuilder(metricName)
            .build());
        counter.add(count);
        counters.put(metricName, counter);
    }

    public void increment(final String metricName) {
        increment(metricName, 1);
    }

    public void flush() {
        meterProvider.forceFlush();
    }
}

Cloud function is scheduled to run every minute and it's guaranteed to finish within that one minute. In fact in most cases it finishes within the first several seconds of starting. But the scheduled metricReader apparently goes off on its own thread every 20s even after the main program finishes its job it seems.

The errors happen most frequently within the first 15 mins of a new deployment. Typically it will be thrown about 3 or 4 such exceptions during its execution, 3 of them from scheduled MetricReader every 20s within the 60s execution window before next execution takes over, and 1 more from the forceFlush() method I call from shutdown() inside main execution function body. Sometimes, after running for a while, the exception frequency will be reduced to only once per execution, that once being from the forceFlush() method before shutdown.

I'm not sure if it exactly helped anything, but putting more manual resource attribute key configuration as below didn't exactly drive exceptions away but was consistently making it at most 1 exception per execution on forceFlush() whereas previously sometimes that frequency got reduced sometimes it didn't after a while.

.setResource(Resource.create(Attributes.of(
                AttributeKey.stringKey("cloud.availability_zone"), "us-east1",
                AttributeKey.stringKey("service.namespace"), "XYZ",
                AttributeKey.stringKey("service.name"), "abcd",
                AttributeKey.stringKey("service.instance.id"), MetadataConfig.getInstanceId()
            )))

Also the error type seems to be alternating between One or more points were written more frequently than the maximum sampling period and One or more of the points specified had an older start time than the most recent point depending on new deployments (maybe they were related to some slight change in metric setup/dependency configuration but probably not, can try to isolate more if that may be a useful troubleshooting route)

If I change the periodic metric reading interval from 20s to 120s, I won't see the above exceptions any more. Because my cloud function is bound to finish within one minute, setting to 120s means only during forceflush() is metrics getting emitted. So it seems like the exceptions are due to conflicts among scheduled metric reporting runs and forceflush()? I'd think exporter would take care of multiple consecutive runs of emitting metrics on the same name and instance id? (but it looks like every call always invokes createTimeSeries and according to google's doc you can't do this multiple times with same name on same instance)

As to the metric itself it does look getting published just fine as I compare it side by side with my logging-based metrics and they largely align, even during executions with 3,4 exceptions. So I'm not sure if there are actual metric loss, and of course the 2 won't exactly align due to accounting difference or some general background loss so I don't know if those exceptions could account for any very minor discrepancies but suffice to say the difference is very minimal if any.

Other than following the example usage in the MetricExporter I can't find any more examples to help me understand the root cause of such exceptions. Internet search would typically point to using Cloud Monitoring API itself or about confusing multiple instances etc but in our case it's just one Cloud Function one instance, no overlapping, with even my attempt above to specify execution-specific instance.id on resource. Hope you could help with some insights, thank you!

P.S. I found this issue attempting to add unique identifier to Java SDK but it got abandoned because they thought setting service.instance.id to SdkMeterProvider would be equivalent. As shown above I tried that but it didn't make the exception go away.

aabmass commented 1 month ago

@psx95 can you take a look?