elastic / apm-server

https://www.elastic.co/guide/en/apm/guide/current/index.html
Other
1.21k stars 518 forks source link

"Output Events Rate" in stack monitoring is always zero #8383

Open axw opened 2 years ago

axw commented 2 years ago

APM Server version (apm-server version): 8.3.0-BC4

Description of the problem including expected versus actual behavior:

"Output Events Rate" in stack monitoring is always zero.

Steps to reproduce:

  1. Start 8.3.0-BC4 with stack monitoring enabled.
  2. Send some events, check that they show up in the APM UI.
  3. Navigate to stack monitoring, observe the "Output Events Rate" chart is always reporting zero.

image

axw commented 2 years ago

Hmm, I just reconfigured the integration with expvar enabled, and now it's working. Maybe there's race condition?

axw commented 2 years ago

Happened again after upgrading from 8.2.3 to 8.3.0-BC4. Initially the output was zero, after reconfiguring the integration (this time changing the event rate limit), the output went non-zero.

axw commented 2 years ago

This is apparently still an issue, at least in system tests, as seen here:

https://apm-ci.elastic.co/blue/organizations/jenkins/apm-server%2Fapm-server-mbp%2FPR-9014/detail/PR-9014/1/pipeline/

lahsivjar commented 1 year ago

I haven't been able to reproduce this exact error. However, due to the way our instrumentation works it is possible that after a reload event the old modelindexer is still receiving data while the instrumentation has moved to the new modelindexer. This is due to the fact that we wait for the old modelindexer to gracefully shutdown however, we switch the monitoring to new modelindexer before the old one exits.

The above will result in the instrumentation data to report 0 until the old indexer shuts down.

simitt commented 1 year ago

Moving this to backlog since we haven't spend more time recently to track this down.

tegenterter commented 1 year ago

It appears that this bug lead up to an incident (https://github.com/elastic/cloud/issues/110723) and should be prioritized

simitt commented 1 year ago

Moved it into the 8.7 milestone again to be picked up and verified if this is still a bug in current versions.

axw commented 1 year ago

I don't recall if this has already been ruled out, but I realise now that I never wrote down on this issue a possible contributing factor: every time we reconfigure the server, we create a new libbeat monitoring registry: https://github.com/elastic/apm-server/blob/32a167b81356e19e9e173bb58a0503eea5e80e3d/internal/beater/beater.go#L628

lahsivjar commented 1 year ago

Hmm, nice catch. I don't remember any conversation around this so I think this hasn't been ruled out.

endorama commented 1 year ago

I was looking at this today and I have 2 questions:

  1. how can I send some test data?
  2. my first hint at this would be to try reusing the libbeatMonitoringRegistry instead of creating it anew like it is done for the output registry https://github.com/elastic/apm-server/blob/32a167b81356e19e9e173bb58a0503eea5e80e3d/internal/beater/beater.go#L634-L639 What do you think?
axw commented 1 year ago

how can I send some test data?

You could use https://github.com/elastic/apm-server/tree/main/systemtest/cmd/sendotlp to send test data to APM Server

my first hint at this would be to try reusing the libbeatMonitoringRegistry instead of creating it anew like it is done for the output registry

You could try, but I don't think that will work. There are assumptions about there being a 1:1 relationship between metrics and outputs, e.g. here: https://github.com/elastic/apm-server/blob/98806224092aa9646d2cf8466517b0955e8476b6/internal/beater/beater.go#L688-L696