Discrepancies with dynatrace publish metrics plugin

Version info:

2.0.0-38

Running this command:

  sh "/home/node/artillery/bin/run version"
  sh "/home/node/artillery/bin/run run --output reports/${testName}.json tests/performance/${testName}.yml"
  sh "/home/node/artillery/bin/run report --output reports/${testName}.html reports/${testName}.json"

I expected to see this happen:

The metrics reported in the report.json to be in sync with the metrics reported to dynatrace so we could plot graphs

Instead, this happened:

Metrics are out of sync I am also noticing the following warnings and errors in the console. Warning

  plugin:publish-metrics:dynatrace Dynatrace reporter: WARNING Metric key 'plugins.metrics-by-endpoint./armadillo (armadillo).codes.200' does not meet Dynatrace Ingest API's requirements and will be dropped. More info in the docs (https://docs.art/reference/extensions/publish-metrics#dynatrace). +1ms
  plugin:publish-metrics:dynatrace Dynatrace reporter: WARNING Metric key 'plugins.metrics-by-endpoint./dino (dino).codes.200' does not meet Dynatrace Ingest API's requirements and will be dropped. More info in the docs (https://docs.art/reference/extensions/publish-metrics#dynatrace). +0ms
  plugin:publish-metrics:dynatrace Dynatrace reporter: WARNING Metric key 'plugins.metrics-by-endpoint./pony (pony).codes.200' does not meet Dynatrace Ingest API's requirements and will be dropped. More info in the docs (https://docs.art/reference/extensions/publish-metrics#dynatrace). +0ms
  plugin:publish-metrics:dynatrace Dynatrace reporter: WARNING Metric key 'plugins.metrics-by-endpoint.response_time./armadillo (armadillo)' does not meet Dynatrace Ingest API's requirements and will be dropped. More info in the docs (https://docs.art/reference/extensions/publish-metrics#dynatrace). +0ms
  plugin:publish-metrics:dynatrace Dynatrace reporter: WARNING Metric key 'plugins.metrics-by-endpoint.response_time./dino (dino)' does not meet Dynatrace Ingest API's requirements and will be dropped. More info in the docs (https://docs.art/reference/extensions/publish-metrics#dynatrace). +0ms
  plugin:publish-metrics:dynatrace Dynatrace reporter: WARNING Metric key 'plugins.metrics-by-endpoint.response_time./pony (pony)' does not meet Dynatrace Ingest API's requirements and will be dropped. More info in the docs (https://docs.art/reference/extensions/publish-metrics#dynatrace). +0ms
  plugin:publish-metrics:dynatrace Sending metrics to Dynatrace +0ms

Error

Apdex score: 0.6377952755905512 (poor)
⠧   plugin:publish-metrics:dynatrace Sending event to Dynatrace +165ms
  plugin:publish-metrics:dynatrace Cleaning up +1ms
  plugin:publish-metrics:dynatrace Waiting for pending request ... +0ms
⠸   plugin:publish-metrics:dynatrace There has been an error in sending metrics to Dynatrace:  HTTPError: Response code 400 (Bad Request)
    at Request.<anonymous> (/usr/local/lib/node_modules/artillery/node_modules/got/dist/source/as-promise/index.js:118:42)
    at processTicksAndRejections (node:internal/process/task_queues:96:5) {
  code: 'ERR_NON_2XX_3XX_RESPONSE',
  timings: {
    start: 1700595079319,
    socket: 1700595079319,
    lookup: 1700595079354,
    connect: 1700595079454,
    secureConnect: 1700595079560,
    upload: 1700595079560,
    response: 1700595079911,
    end: 1700595079911,
    error: undefined,
    abort: undefined,
    phases: {
      wait: 0,
      dns: 35,
      tcp: 100,
      tls: 106,
      request: 0,
      firstByte: 351,
      download: 0,
      total: 592
    }
  }
} +431

Files being used:

config:
  target: http://asciiart.artillery.io:8080
  phases:
    - duration: 15
      arrivalRate: 50
      rampTo: 100
      name: Warm up phase
    - duration: 10
      arrivalRate: 20
      rampTo: 50
      name: Ramp up load
#    - duration: 15
#      arrivalRate: 10
#      rampTo: 30
#      name: Spike phase
  plugins:
    ensure: {}
    apdex: {}
    metrics-by-endpoint: {}
    publish-metrics:
      - type: dynatrace
        # DY_API_TOKEN is an environment variable containing the API key
        apiToken: REPLACED-BY-PIPELINE
        envUrl: "https://apm.cf.company.dyna.ondemand.com/e/MASKED"
        prefix: "artillery."
        dimensions:
          - "service:ordService"
          - "test:crteOrd"
          - "host_id:1.2.3.4"
        event:
          title: "Loadtest"
          entitySelector: "type(SERVICE),entityName.equals(MyService)"
          properties:
            - "Tool:Artillery"
            - "Load per minute:100"
            - "Load pattern:development"
  apdex:
    threshold: 100
  ensure:
    thresholds:
      - http.response_time.p99: 100
      - http.response_time.p95: 75
  metrics-by-endpoint:
    useOnlyRequestNames: true

scenarios:
  - flow:
      - get:
          url: "/dino"
          name: dino
      - get:
          url: "/pony"
          name: pony
      - get:
          url: "/armadillo"
          name: armadillo

Also kindly note. I am publishing these metrics from a jenkins job. When i execute the tests from my local. The correct metrics seem to be reported to dynatrace. But when I run the job from jenkins. Something seems a miss. I have setup the job following artillery documentation.

Something to note: While executing from local, I run one test at a time. Metrics are published properly. When the jenkins job runs one test at a time as well metrics are correctly published. When there are more than 1 tests -> Thats where the issue seems to appear with one test reporting wrong metrics.

Kindly refer the screenshots from the jenkins console and also the screenshot showing the metric reported in dynatrace. Screenshot 2023-11-21 at 2 56 00 PM Screenshot 2023-11-21 at 2 58 55 PM

@InesNi Kindly help with this.

Hi @Thilaknath !

Thanks for the thorough report 🙏🏻

Warnings

Regarding the warnings, this is due to the metrics-by-endpoint characters. As you can see in the report, even though you've set useOnlyRequestNames, it's not actually using that in the metric names. The reason is that you've set it outside the config for plugins. Unlike apdex and ensure, the metrics-by-endpoint plugin must set its config under config.plugins.

So if you do it like this, it should work 🤞🏻 :

plugins:
   metrics-by-endpoint:
      useOnlyRequestNames: true

Metrics discrepancies in parallel runs

Regarding this, we'll investigate it on our side as soon as possible. In the meantime, I have some questions:

When you say there's a descrepancy in the metrics, do you see less or more metrics than expected? I know in the image you sent it shows 59, but the report in your image is a summary report, so perhaps the 59 was from a single intermediate report?
In the data you showed, what happens if you zoom to around the specific 3-4 mins of the test run? Do you still only see the 59 codes?
How are you filtering the metrics in the Dynatrace UI? If two tests are running in parallel, then you'd need some way to filter to a specific one with a unique identifier (e.g. test run name). This is possibly an improvement we can make (adding a test run id as an attribute by default);
Could you possibly have something in your Dynatrace instance either rate limiting you, or filtering certain metrics somehow? Could there be some sort of delay in getting the metrics?

Thanks again!

@InesNi Thank you for your response. I will modify my scripts to have metrics-by-endpoint: properly configured. Regarding the metrics to dynatrace.

1) I use the dimension which I set in my test script to filter the metrics for scenarios in dynatrace UI. As you can see from my script above. I set the following and in dynatrace I split the graph to show based on test

        dimensions:
          - "service:ordService"
          - "test:crteOrd"
          - "host_id:1.2.3.4"

From the image below. The purple line shows the correct metrics for one of my test and the yellow line is reporting wrong metrics.

2) Both the tests were sending around a total of 4500 Requests. 3) I am seeing less metrics for one of my tests when 2 tests are executed. Regarding your comment perhaps the 59 was from a single intermediate report? -> I don't think thats the case either as I tried to search the JSON manually for the value and did not find the metric reported from any of the intermediate reports. 4) Zooming in doesn't make any difference. 5) There is definitely not a delay in getting the metrics. As even after the test completion if I wait, I don't see the values as reported in the report.json . Regarding if there is some api rate limit in dynatrace. I will have to check with the central team. 6) Kindly let me know if there is any more information I can provide you

Hello Team, Is it possible to get some update with this ticket ? @hassy Thank you

Hi @Thilaknath,

Thank you for all the details.

We are still looking into this and will let you know as soon as we have more info. I do have a couple of questions:

Have you tried running parallel tests locally to see if it only happens when publishing metrics from the jenkins job?
Have you checked with the central team about rate limits?

Hello @InesNi Yes there are rate limits and that follows along this specifications here: https://docs.dynatrace.com/docs/extend-dynatrace/extend-metrics#limits

Somemore findings from running the tests locally. Below you will screenshots from my terminal which shows the summary and also the metrics I see in dynatrace. (Note: This was for while two tests are running in parallel)

parallelRunDynatrace

From what you see below (We are interested in the yellow colour plots) -> The summary says there were around 6750 http.202.count for the test that was executed. But the graphs have different metrics.

Note: I tried running just one test and I noted that. For a single test there are 3 different data points in dynatrace. The sum of all the counts comes close to the summary shown in my terminal. But this messes up the visualization in dynatrace as you cannot see how a test degrades or improves over time with new features .

For a single run, Ideally it would be nice to have a single set of metrics published. Another screenshot showing the 3 plots in dynatrace

Screenshot 2023-11-30 at 9 44 00 AM

Hi @Thilaknath 👋🏻

Apologies it took me a bit to look into this, I've been trying to replicate it using an actual Dynatrace setup 👍🏻 Here are my findings:

Regarding getting a single set of metrics published

The publish-metrics plugin (and its reporters) send metrics continuously to the platform while the test is running (called intermediate reports). This is done by design, as it allows you to visualise performance while a test is running, over the course of a test. That's particularly important for response_time metrics, as it allows you to visualise if they've gone up over the duration of the test, and correlate with other metrics (e.g. from your own system). For example:

Screenshot 2023-12-04 at 20 55 10

It might be good to use different visualisations for different types of data. For count type metrics, you might want to use the Single value or Top list visualisations, so you can obtain a single value. For example:

Additionally, you want to make sure you are using attributes to slice the data in ways that make it easier to visualise. So things like unique ids (for example to group tests together), naming, commit shas, etc, can all help you visualise it better. We leave this sort of decision in how best to visualise the data up to the users, as we can't have expertise in all the observability platforms we support. But hopefully the above made sense and helped!

Regarding missing metrics

I have not been able to replicate this when running 2 tests in parallel. I think there's two things that could be at play here:

Given your questions above, perhaps you should try different ways to visualise the data? As I mentioned, I tried several tests and was always able to see the full data, as long as I made sure to try to visualise it in the most appropriate way for the metric type.
We do have an open bug where sometimes a couple of intermediate reports are lost when using rampTo. We also believe that the number of CPU cores available plays a role in the bug, so it might be that when running two tests in Jenkins or locally, tests aren't getting the same amount of CPU cores on each test? Just speculating here. I would suggest a) making sure that you run the test with the same resources every time, and b) trying out running without rampTo and see if you still observe discrepancies.

Implementing aggregated metrics only

If you still feel like you don't want intermediate reports, but only the aggregate values to make it to the platform, we could look into eventually implementing it. There has been another feature request for this for another reporter.

But again, you'll lose out on the information of what happened over the duration of the test.

Hope that helps!

Thank you @InesNi for the detailed response. I will take a look to see how I can create better reporting in this case.

artilleryio / artillery