Open rockdaboot opened 1 month ago
Pinging @elastic/obs-knowledge-team (Team:obs-knowledge)
When deploying Universal Profiling via Elastic Agent (so far the most common deployment model), the settings for probabilistic profiling are exposed and configurable to costumers and documented since the release of 8.8.
If the user ask for profiling data for 10 seconds, how can we know whether this was during an off or on period (or even both)?
From a backend point of view, it is not possible to know when a certain host did not[^1] send data, because its currently selected probabilistic value is below the configured probabilistic threshold.
The described problem with the selected too small time range sound to me similar to the visualization problem of the sampling frequency of a single host. The UI allows to zoom in between the sampled events. One way to prevent such issues could be to make users visually aware and stop zooming in (in time) at some point. E.g. the UI could enforce that at least two sampling events could be theoretical displayed - so the limit is here on the configured sampling frequency and the probabilistic interval.
[^1]: Leaving out here also all the cases where due to network connection issues profiling data is lost.
Thank you, @florianl! I updated parts of the PR description from your information.
so the limit is here on the configured sampling frequency and the probabilistic interval.
And on the number of present cores.
Description
The current implementation of the CO2Calculator and the CostCalculator assume a hard-coded sampling frequency of 20Hz (here and here).
The host agent allows setting values for probabilistic profiling via CLI flags and when deploying via Elastic Agent since release 8.8. The ES profiling plugin still assumes probabilistic profiling being disabled, while the customer already uses probabilistic profiling. In these cases, the values shown in Kibana for CPU usage, CO2 generation and $ costs will be just wrong.
In order to achieve this, the fields
profiling.agent.config.probabilistic_interval
andprofiling.agent.config.probabilistic_threshold
from data streamprofiling-hosts
needs to be taken into account.From the agent help output for
-probabilistic-threshold
:So value of 100 means "always on", 50 means "on in 50% of the time" and 0 means "always off".
When querying larger time ranges, the on/off effect evens out (e.g. 50% means we have to adjust the CO2 and cost values by a factor of
1/0.5
or 2.But the smaller the query time ranges are, the higher is the statistical bias that we introduce. And there is possibly nothing much we can do about it: If the user ask for profiling data for 10 seconds, how can we know whether this was during an off or on period (or even both)?
So for small time ranges (in relation to the probabilistic interval), this needs some more thoughts.