elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.72k stars 8.14k forks source link

[APM][Services] Change log charts from Log Error % to Log Error Rate on overview #190987

Closed iblancof closed 2 weeks ago

iblancof commented 1 month ago

Describe the feature:

We are currently showing Log Rate and Log Error % charts in the overview.

We would like to change Log Error % to Log Error Rate by using count(kql='log.level: "error" OR log.level: "ERROR"') / [PERIOD_IN_MINUTES] (as stated in this comment)

elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

iblancof commented 3 weeks ago

Hi @roshan-elastic!

Could you please provide the text we would like to have in the new chart popover? Thank you!

Screenshot 2024-08-26 at 13 46 00

iblancof commented 3 weeks ago

Test https://github.com/elastic/kibana/issues/191253, it will probably be closed at the same time.

iblancof commented 3 weeks ago

Hey again @roshan-elastic,

While reviewing the code that impacts both the overview charts and the columns in the list, I noticed what seems to be an inconsistency.

Based on my understanding from the meetings and your request in this comment, I had assumed that the information in the list columns and the information in the overview charts was meant to be the same.

In the comment formulas you requested, the PERIOD_IN_MINUTES is always considered. However, from what I see in the current code, it seems is not being used in any case.

Could you please clarify if we should have "per minute" applied in all cases (table columns and charts)? Does this mean that the current state of the code is incorrect and that the Log Rate chart and column should also be considering the PERIOD_IN_MINUTES?

If anything’s unclear, just let me know and we can have a call.

Thanks in advance!

cc: @kpatticha (Git Blame suggests that you might have some context on this)

List of services

Column "Log rate (per min)" popover info

Rate of logs per minute observed for given service.name.

Formula Calculation:
count(kql='log.level: *') / [PERIOD_IN_MINUTES]

Column "Log error %" popover info

% of logs where error detected for given service.name.

Formula Calculation:
count(kql='log.level: "error" OR log.level: "ERROR"') / count(kql='log.level: *')

Code calculations (?) These don’t seem to be used, but we are using the ones referenced in the charts, which do not include anything related to minutes.

Screenshot 2024-08-26 at 17 51 39

Service overview

Chart "Log rate" popover info

Rate of logs per minute observed for given service.name.

Formula Calculation:
count(kql='log.level: *') / [PERIOD_IN_MINUTES]

Code calculations

Screenshot 2024-08-26 at 17 52 11

Chart "Log error %" popover info

% of logs where error detected for given service.name.

Formula Calculation:
count(kql='log.level: "error" OR log.level: "ERROR"') / count(kql='log.level: *')

Code calculations

Screenshot 2024-08-26 at 17 51 51
kpatticha commented 3 weeks ago

In the https://github.com/elastic/kibana/issues/189923#issuecomment-2306841788 formulas you requested, the PERIOD_IN_MINUTES is always considered. However, from what I see in the current code, it seems is not being used in any case.

@iblancof PERIOD_IN_MINUTES is referred to to the timerange the user selects.

Code calculations (?) These don’t seem to be used, but we are using the ones referenced in the charts, which do not include anything related to minutes

Initially for the the detailed statistics, ex sparklines, we fetched APM and Logs separately. We updated to use EEM https://github.com/elastic/kibana/pull/189873 https://github.com/elastic/kibana/issues/189733#issuecomment-2265501014

IMHO, as for the per min we need to be consistent for the in the columns if we're planning to update Log Error % to calculate the Log error rate.

For the chart though, we don't display the chart per minute. The bucket size is determined dynamically based on the timerange. For example, if you select 7 days, it doesn't make sense to display the chart per minute (7 24 60 ) but instead per day.

https://github.com/elastic/kibana/blob/9fa7fd5a70fab3bd61a34cfeb3434eda8757b395/x-pack/plugins/observability_solution/apm/server/routes/entities/get_service_entities_history_timeseries.ts#L38-L42

roshan-elastic commented 3 weeks ago

Hey @iblancof @kpatticha - looks like we might be on the same page here?

I'm not too worried about the exact formula - as long as the rate is 'per minute' and we do it consistently between the log rate and log error rate...all good!

Let me know if we need to discuss/confirm things here...(can be a topic for tomorrow's weekly ECO call as well if it's easier)

iblancof commented 3 weeks ago

Thank you both for your responses!

I'll be meeting with @kpatticha this morning to clear up a few things I may have misunderstood due to my limited knowledge in this area. If anything remains unclear afterward, we can address it in the weekly meeting as @roshan-elastic suggested.

kpatticha commented 3 weeks ago

Just confirmed with Kevin that the LogRate is per minute

iblancof commented 3 weeks ago

As discussed in the weekly meeting, we will keep the intervals as they are now. We just need to change the calculations from Log Error % to Log Error Rate.

@roshan-elastic, could you please provide the text for the popovers and titles? Thanks!

roshan-elastic commented 3 weeks ago

Sure thing @iblancof

Log Rate (per minute)

Rate of logs per minute observed for given service.name.

Formula Calculation: count() / [PERIOD_IN_MINUTES]

Log Error Rate (per minute)

Rate of error logs per minute observed for given service.name.

Formula Calculation: count(kql='log.level: "error" OR log.level: "ERROR" OR error.log.level : 'error') / [PERIOD_IN_MINUTES]

roshan-elastic commented 3 weeks ago

@iblancof - sorry, I've just realised that the old error rate copy was wrong. I think the actual formulae being used might have been wrong too.

We want to count APM error logs too (which would be error.log.level : error):

(I thought this was the logic we previously used but I have feeling it got missed somehow)

iblancof commented 3 weeks ago

All good, thank you @roshan-elastic!