elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.22k stars 524 forks source link

Investigate nightly benchmarks 0 events/s issue #13738

Open carsonip opened 3 months ago

carsonip commented 3 months ago

Nightly benchmarks occasionally report 0 events/s. Investigate the root cause of it.

lahsivjar commented 3 months ago

Status update

The first thing I looked at was what was getting reported by the benchmark failures. Here are 2 links to the benchmark run:

  1. Run with events/sec metric populated - Link to APM-Server logs - Link to deployment
  2. Run without events/sec metric populated - Link to APM-Server logs - Link to deployment

Both of these show 500 internal error, however, the logs for 0 events/sec additionally show data validation errors due to unexpected EOF. These errors seemed to be logged from here. This could be an issue with our sender, however, the most intriguing thing is why only a subset of delta metrics are reported as 0. For example: in the above link, the txn/sec and metrics/sec are reported correctly whereas other delta metrics are reported as zero.

I have tried reproducing the errors locally but haven't succeeded (note that the expvar metrics collection is designed for benchtimes in minutes so if testing locally make sure that you have a good enough benchtime to give expvar metrics to work correctly). I did see some special handling in the expvar metric collection but nothing explains this bug.

I have also created a PR to log errors in expvar endpoint which was not done before. I am not sure how helpful it will be though.

simitt commented 2 weeks ago

Is this still happening?

rubvs commented 1 week ago

@simitt I had this happen to me in a run on GH Actions last week, see Slack Thread: https://elastic.slack.com/archives/C95SB62AG/p1729263104854879