elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.71k stars 8.13k forks source link

Service entity overview only shows logs that have `log.level` present. #189923

Closed smith closed 2 days ago

smith commented 1 month ago

189389 specifies that we should hide logs that don't have a log.level attribute:

We require log.level vs any logs in the logs-* index to filter out a lot of noise from APM likely to relating failed transactions (to avoid duplication) and general unhelpful noise.

log.level is not marked as a required field in the Elastic Common Schema Log Fields specification. log.level is also not present in OpenTelemetry logs unless the field is added in the collector.

If we have a reason to filter out logs from APM errors (I'm not sure we do, other than "noise") we should explicitly filter out APM errors instead of requiring an arbitrary that happens to match.

✔️ Acceptance criteria

elasticmachine commented 1 month ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

roshan-elastic commented 1 month ago

This would be a good discussion. The goal with the 'log rate' and the log 'error rate' was to give the user an indication of throughput/traffic vs errors so figuring out how to try and present these metrics to the user in a consistent way between APM and logs-only services would be valuable.

Keen to hear people's thoughts.

I'd also like to see what we have running in practice with some customers if possible (or at the least, present to them and ask for their thoughts on the experience).

roshan-elastic commented 3 weeks ago

@smith @iblancof @cauemarcondes

After a few calls this week with customers, the consensus is that:

Proposed Changes

Log Rate : count() / [PERIOD_IN_MINUTES] Log Error Rate : count(kql='log.level: "error" OR log.level: "ERROR" OR error.log.level : "error"') / [PERIOD_IN_MINUTES]

Any thoughts/opinions?

I'd love to make these changes in the service inventory and the service views as part of the current milestone for services if possible.

iblancof commented 2 weeks ago

Hi @roshan-elastic,

I already added the spec in [APM][Services]Change log charts from Log Error % to Log Error Rate on overview #190987.

Regarding the service inventory column change I just created https://github.com/elastic/kibana/issues/191253, feel free to move it under a public epic or milestone if needed!

ℹ️ I mistakenly created the issue in the private repo, but I've already closed it.

roshan-elastic commented 2 weeks ago

Wkd @iblancof

Do you think we could create an issue to change the log rate to remove the need for log.level to be included?

roshan-elastic commented 2 weeks ago

Also, do we think we should change these on the service inventory in the same issue or should we break these out into separate issues?

iblancof commented 2 weeks ago

Hi @roshan-elastic,

In summary, we want the Log Rate for both the inventory and the overview charts to not require having the log.level field populated as mandatory.

As for creating a separate issue for the service inventory and the overview charts, I think we can handle both things within this same issue we're discussing in, right?

My question is about the priority. Since we're focused on the adaptive view, when should we tackle this? Should it be part of the epic? If so, we should include it to clarify priorities and ensure everything is delivered together, even though the scope is different.

roshan-elastic commented 2 weeks ago

Hey @iblancof

In summary, we want the Log Rate for both the inventory and the overview charts to not require having the log.level field populated as mandatory.

Correct!

My question is about the priority. Since we're focused on the adaptive view, when should we tackle this?

Could we do this after we've finished the service view? Goal is that when customers see these new adaptive views - we have consistent log metrics in both the service inventory and the service view.

iblancof commented 2 weeks ago

Going to add this issue as related to the Epic so we make sure we don' t forget about it.

cauemarcondes commented 1 week ago

The action here is not removing the log.level, but rather what other field should we use, am I right? We currently use the log.level to define the log error rate. And we do it because AFAICT there's no standard log-level field.

Log Error Rate : count(kql='log.level: "error" OR log.level: "ERROR"') / [PERIOD_IN_MINUTES]

Like the suggested changes is still relying on the log.level field.

roshan-elastic commented 1 week ago

Hey @cauemarcondes,

Just to play back my understanding of the issue:

Log Rate We would basically count all logs:

count() / [PERIOD_IN_MINUTES]

This one wouldn't need log.level at all.

Log Error Rate Instead of showing a %, we'd show a 'rate' and also update the error to look at:

count(kql='log.level: "error" OR log.level: "ERROR" OR error.log.level : "error"') / [PERIOD_IN_MINUTES]

This one would need either log.level or error.log.level (done via APM I believe) to be set.

Make sense?

cauemarcondes commented 1 week ago

Yes, that makes sense. Now I'm reading the comments again and I got what you meant. Sorry for the noise.

roshan-elastic commented 1 week ago

All good @cauemarcondes. Sorry, this whole thing has been a bit noisy because we're updating these based on user feedback we've only just received in the last week or so.

cauemarcondes commented 1 week ago

@roshan-elastic We should also change the EEM data transforms as they also expect the log.level to exist to calculate the logRate. cc: @simianhacker / @klacabane

klacabane commented 1 week ago

One issue is that an unfiltered count() would include the apm metrics documents so we still need to either identify what is a log or a metrics in the source indices documents. I think excluding metrics would be the easiest (eg metricset.name != "service_transaction" or metricset.name != "service_summary")

roshan-elastic commented 1 week ago

Hey @klacabane - would it include all of them if we specified to only match those from logs-*?

It's OK if they're APM logs writing to logs-apm-* etc.

klacabane commented 1 week ago

would it include all of them if we specified to only match those from logs-*?

@roshan-elastic Yes, I think we have these options:

I think option 2 would give the most accurate results

roshan-elastic commented 1 week ago

Cheers @klacabane.

I think we should make this as simple as possible for users to comprehend so I'm not sure about adding in exclusions vs specifying a simple inclusion.

As a user, I'm thinking that this metric would be calculated in the same way as this lens visualisation:

Filtering for logs Image

Are we able to replicate this logic?

cc @iblancof as we'll want the same logic for showing the counts that we use in the service views

iblancof commented 1 week ago

Thanks for sharing @roshan-elastic .

The Log Error Rate is being handled like this in this PR. I expect the changes on the Log Rate to be handled in this issue we are commenting in.

kpatticha commented 1 week ago

@klacabane should we create a ticket to update the logRate in the service entity definition ?

Happy to make the changes if you can provide the best approach to filter out the logs.

klacabane commented 1 week ago

@kpatticha please feel free to update the definition. As for the best way to get the logRate I listed different approaches and I think both 1 and 2 should provide the results you're looking for but I didn't test that

smith commented 5 days ago

Fixed by https://github.com/elastic/kibana/pull/191962.

iblancof commented 5 days ago

Hey @smith, as I mentioned in this comment, I don't believe this issue is resolved by PR #191962.

PR #191962 focuses on modifying the Log Error Rate chart and formula, while this issue is about ensuring that Logs are not filtered by log.level in the Log Rate.

kpatticha commented 5 days ago

@iblancof is right.

I'm assigning this to myself to update the entity definition and the query accordingly

kpatticha commented 5 days ago

I'm also removing the N/A badge when the log.level is not defined since it won't be valid any more

Image

FYI @roshan-elastic

smith commented 4 days ago

Hey @smith, as I mentioned in https://github.com/elastic/kibana/pull/191962#issuecomment-2328136920, I don't believe this issue is resolved by PR https://github.com/elastic/kibana/pull/191962.

Sorry I missed that @iblancof. Thanks for looking!