elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.5k stars 24.6k forks source link

[Profiling] Support probabilistic profiling #111597

Open rockdaboot opened 1 month ago

rockdaboot commented 1 month ago

Description

The current implementation of the CO2Calculator and the CostCalculator assume a hard-coded sampling frequency of 20Hz (here and here).

The host agent allows setting values for probabilistic profiling via CLI flags and when deploying via Elastic Agent since release 8.8. The ES profiling plugin still assumes probabilistic profiling being disabled, while the customer already uses probabilistic profiling. In these cases, the values shown in Kibana for CPU usage, CO2 generation and $ costs will be just wrong.

In order to achieve this, the fields profiling.agent.config.probabilistic_interval and profiling.agent.config.probabilistic_threshold from data stream profiling-hosts needs to be taken into account.

From the agent help output for -probabilistic-threshold:

If set to a value between 1 and 99 will enable probabilistic profiling:
every probabilistic-interval a random number between 0 and 99 is
chosen. If the given probabilistic-threshold is greater than this
random number, the agent will collect profiles from this system for
the duration of the interval."

So value of 100 means "always on", 50 means "on in 50% of the time" and 0 means "always off".

When querying larger time ranges, the on/off effect evens out (e.g. 50% means we have to adjust the CO2 and cost values by a factor of 1/0.5 or 2.

But the smaller the query time ranges are, the higher is the statistical bias that we introduce. And there is possibly nothing much we can do about it: If the user ask for profiling data for 10 seconds, how can we know whether this was during an off or on period (or even both)?

So for small time ranges (in relation to the probabilistic interval), this needs some more thoughts.

### High-level tasks
- [ ] Find a way to either prevent or warn about when the user select a time range that significantly affects the accuracy (with and without probabilistic profiling).
- [ ] Just use the probabilistic settings for CO2 and cost calculation in the ES profiling plugin (ignore any accuracy considerations as these are handled by the previous task).
elasticsearchmachine commented 1 month ago

Pinging @elastic/obs-knowledge-team (Team:obs-knowledge)

florianl commented 1 month ago

When deploying Universal Profiling via Elastic Agent (so far the most common deployment model), the settings for probabilistic profiling are exposed and configurable to costumers and documented since the release of 8.8.

If the user ask for profiling data for 10 seconds, how can we know whether this was during an off or on period (or even both)?

From a backend point of view, it is not possible to know when a certain host did not[^1] send data, because its currently selected probabilistic value is below the configured probabilistic threshold.

The described problem with the selected too small time range sound to me similar to the visualization problem of the sampling frequency of a single host. The UI allows to zoom in between the sampled events. One way to prevent such issues could be to make users visually aware and stop zooming in (in time) at some point. E.g. the UI could enforce that at least two sampling events could be theoretical displayed - so the limit is here on the configured sampling frequency and the probabilistic interval.

[^1]: Leaving out here also all the cases where due to network connection issues profiling data is lost.

rockdaboot commented 1 month ago

Thank you, @florianl! I updated parts of the PR description from your information.

rockdaboot commented 1 month ago

so the limit is here on the configured sampling frequency and the probabilistic interval.

And on the number of present cores.