GoogleCloudPlatform / opentelemetry-operations-go

Apache License 2.0
130 stars 100 forks source link

Offline Queuing of Logs and Metrics #812

Closed skrawn closed 7 months ago

skrawn commented 7 months ago

I was trying to deploy the OpenTelemetry Collector on some embedded Linux devices and since these devices are on cellular connections, they are not always able to reach the internet. But I'd still like to be able to queue metrics and logs for upload once the device re-establishes connection. Am I correct that this will work with the sending_queue, retry_on_failure and file_storage components, like this:

extensions:
  file_storage:
    directory: /etc/otelcol/offline
    compaction:
      on_start: true
      directory: /etc/otelcol/offline
      max_transaction_size: 65_536
    fsync: true

exporters:
  googlemanagedprometheus:
    metric:
      compression: gzip
    retry_on_failure:
      enabled: true
      max_elapsed_time: 86400s
    sending_queue:
      enabled: true
      storage: file_storage
      num_consumers: 1
      queue_size: 1000

The reason I ask is that retry_on_failure was not implemented on the Google Managed Prometheus exporter and so requests that timeout due to network failures result in metrics getting discarded. If the retry_on_failure component works as expected, I'll probably try to implement it for the GMP exporter. I also see that there are some problems with retry_on_failure depending on the exporter, like this one for the Google Cloud exporter, so maybe there is some limitation within the collector that I am not aware of?

damemi commented 7 months ago

The issues with using retry_on_failure in the Google Cloud exporter would be the same with using it in the GMP exporter. For handling network outages, we would probably want to enable the write-ahead-log option that the GCP metrics exporter has (but this requires local storage for the WAL file).

skrawn commented 7 months ago

The issues with using retry_on_failure in the Google Cloud exporter would be the same with using it in the GMP exporter. For handling network outages, we would probably want to enable the write-ahead-log option that the GCP metrics exporter has (but this requires local storage for the WAL file).

Oh I see, I didn't notice the GC exporter had this. I may just be able to use that...

dashpole commented 7 months ago

The googlemanagedprometheus and googlecloud exporters have their own intelligent retry mechanisms built-in. retry_on_failure would add a second layer of retries, and will also retry requests which are guaranteed to fail (it isn't as smart as the built-in retry). This can cause additional problems, which is why we've removed the retry_on_failure helper from the exporter.

skrawn commented 7 months ago

I see, I appreciate the context. I'll this close issue and work with the WAL options.