argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.79k stars 873 forks source link

Can't query datadog logs using analysis template #2628

Open joey100 opened 1 year ago

joey100 commented 1 year ago

Checklist:

Describe the bug

When we use analysis template to query datadog logs, even though the analysisrun is running, the value is '[]', while in datadog we do see the results. When we use analysis template to query datadog metrics, it works fine, the analysisrun runs and can get the expected values.

To Reproduce

  1. Define an analysistemplate like below:
    apiVersion: argoproj.io/v1alpha1
    kind: AnalysisTemplate
    metadata:
    name: experimental-test-error-count
    spec:
    args:
    - name: service-name
    - name: env
    - name: version
    metrics:
    - name: latency
    interval: 30s 
    successCondition: default(result, 0) <= 100
    failureLimit: 1 
    provider:
      datadog:
        interval: 5m
        query: |
          logs("service:experimental-bff status:error").index("*").rollup("count")
  2. Run the rollout with the template.
  3. The analysisrun result is below:
spec:
  metrics:
  - failureLimit: 1
    interval: 30s
    name: latency
    provider:
      datadog:
        interval: 5m
        query: |
          logs("service:experimental-bff status:error").index("*").rollup("count")
    successCondition: default(result, 0) <= 100
  terminate: true
status:
  dryRunSummary: {}
  message: Run Terminated
  metricResults:
  - count: 6
    measurements:
    - finishedAt: "2023-02-27T10:33:04Z"
      phase: Successful
      startedAt: "2023-02-27T10:33:03Z"
      value: '[]'
    - finishedAt: "2023-02-27T10:33:34Z"
      phase: Successful
      startedAt: "2023-02-27T10:33:34Z"
      value: '[]'
    - finishedAt: "2023-02-27T10:34:04Z"
      phase: Successful
      startedAt: "2023-02-27T10:34:04Z"
      value: '[]'
    - finishedAt: "2023-02-27T10:34:34Z"
      phase: Successful
      startedAt: "2023-02-27T10:34:34Z"
      value: '[]'
    - finishedAt: "2023-02-27T10:35:04Z"
      phase: Successful
      startedAt: "2023-02-27T10:35:04Z"
      value: '[]'
    - finishedAt: "2023-02-27T10:35:34Z"
      phase: Successful
      startedAt: "2023-02-27T10:35:34Z"
      value: '[]'
    name: latency
    phase: Successful
    successful: 6
  phase: Successful
  runSummary:
    count: 1
    successful: 1
  startedAt: "2023-02-27T10:33:04Z"

Expected behavior

The analysisrun could query datadog logs successfully, with the correct logs result but not the nil result.

Screenshots

analysisrun-result datadog-result

Version

1.4

Logs

# Paste the logs from the rollout controller

# Logs for the entire controller:
kubectl logs -n argo-rollouts deployment/argo-rollouts

# Logs for a specific rollout:
kubectl logs -n argo-rollouts deployment/argo-rollouts | grep rollout=<ROLLOUTNAME

Message from the maintainers:

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

alonbehaim commented 1 year ago

@joey100 seems that currently only metrics supported as you can see here Anyway I can recommend you to move to use datadog api v2 based on api/v2/query/timeseries

You can also create logs pipeline in datadog to generate metrics and use them, maybe it should be feature request to support logs.

joey100 commented 1 year ago

@joey100 seems that currently only metrics supported as you can see here Anyway I can recommend you to move to use datadog api v2 based on api/v2/query/timeseries

You can also create logs pipeline in datadog to generate metrics and use them, maybe it should be feature request to support logs.

Got it, thanks.

deadlysyn commented 1 year ago

we found the same issue where metrics is hard coded. still thinking of creative workarounds (thanks for the pipeline idea!), but unless there is a technical reason to avoid it supporting logs, apm/spans, etc. query types would be a useful feature.

trying to use apiVersion: v2 in a ClusterAnalysisTemplate with argo-rollouts:latest gives error despite seeming to match docs and code clearly supporting it. 🤔

image

apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: error-rate
spec:
  args:
  - name: dd-service-name
  metrics:
  - name: error-rate
    interval: 1m
    successCondition: default(result, 0) < 1
    failureLimit: 1
    provider:
      datadog:
        apiVersion: v2
        interval: 1m
        query: |
          avg(last_1h):anomalies(sum:trace.graphql.execute.errors{cluster_name:foobah-eks-2022-09,env:production,service:{{args.dd-service-name}}}, 'robust', 4, direction='above', alert_window='last_5m', interval=1, count_default_zero='true')

just removing apiVersion works fine. ~what am i missing?~

apiVersion: v2 is only supported on 1.5.0-rc1.

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 60 days with no activity.