elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.69k stars 8.24k forks source link

[Infra UI] Network TX/RX show inaccurate spikes in activity #164159

Closed roshan-elastic closed 7 months ago

roshan-elastic commented 1 year ago

Note : The root cause has been explained by one of our SREs but will likely need some investigation to confirm

🔗 Key Links

Issues

Note : This will be completed once the epic has been refined and issues created

## Issues/Tasks
- [ ] Add issues here

📖 Description

Network RX/TX byte sizes are showing peculiar activity for a customer when they change time range:

Last 15 mins - Kb Image

Last 15 mins - Pb Image

Apparently this is due to well known issue with how the host.network.ingress.bytes and host.network.egress.bytes fields are calculated in Observability (in general - not just Elastic).

Discover - records showing a huge spike due to incorrect field calculation Image

Lens - showing these trended over time...look at the spikes... Image

Note : We use the following formulae to calculate RX/TX in our Infra UI charts:

RX TX

✔️ Acceptance criteria

What must this feature have?

Draft

1. Must Have

Must be delivered in this issue in order for the release to be valuable

Name Description Notes
Network RX/TX should not show these spikes if a host restarts The user should not need to see this behaviour caused by machine reboots -

2. Should Have

Name Description Notes
- - -

3. Could Have

Would be nice to have but not critical

Name Description Notes
- - -

4. Will Not Have (for now)

Explicitly will not be looked at within this issue

Name Description Notes
- - -

📈 Telemetry Process

elasticmachine commented 1 year ago

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

roshan-elastic commented 1 year ago

Note - I'm speaking with George about why this might be. It looks like a known issue with how we calculate the host.network.ingress.bytes field in Elastic that we may need to work around.

We can tackle during refinement.

roshan-elastic commented 1 year ago

Hey @nimarezainia,

Would love to get your take/thoughts on this...(context in the issue description).

It seems that agent hooks onto a lifetime counter that hosts maintain to calculate things like host.network.ingress.bytes.

When someone reboots the host, it appears those counters are reset and the fields we use (host.network.ingress.bytes) then show a massive spike in ingress/egress. This then results in large inaccuracies in the data in our UI (see issue description).

I was wondering if you have any thoughts/advice on how this could be handled?

crespocarlos commented 1 year ago

@roshan-elastic wouldn't it be better if this ticket ticket had been opened to the beats team? Our team can't do much about it if the problem happens on the metrics collection side.

nimarezainia commented 1 year ago

@cmacknz when the host reboots, should we be reading zero as well or do we ignore zero values?

cmacknz commented 1 year ago

should we be reading zero as well or do we ignore zero values?

We may not read zero, but we will read a value that is lower than it was previously. This should be something we can detect and account for.

This actually sounds familiar, what version is this? There was a PR to fix this in https://github.com/elastic/beats/pull/35977 the original issue was https://github.com/elastic/beats/issues/35944.

Looks like that commit would be in 8.10 so if this isn't an 8.10.0-SNAPSHOT version then it won't have it.

nimarezainia commented 1 year ago

@roshan-elastic @crespocarlos is this something that you have reproduced? I don't see a version mentioned here in this issue but I suspect not an 8.10. per @cmacknz comment (thanks Craig) could we try the newer image?

crespocarlos commented 1 year ago

Hi @nimarezainia , I haven't reproduced this issue in particular.

I reported this https://github.com/elastic/beats/issues/35944, but it has been fixed and I haven't seen the problem again since then. The problem here looks the same.

We'll have to see which version was being used when this error was reported. By looking at the UI, I would say that at least Kibana is running on 8.8. @roshan-elastic is on PTO, I'll confirm that when he's back.

roshan-elastic commented 1 year ago

Thanks everyone!

@crespocarlos - when we pick up this ticket, is there a way we can test the behaviour to see if we can get the bug to show again?

Based on my understanding, if we were to run a load of network activity on a host and then reboot it, we should see weird numbers with the Rx/Tx if this bug were still present...

Not sure the best way to approach this but I'm thinking that if we try to break it and it still works OK, we can probably close this unless a users reports this on one of the newer versions (where it should be fixed)...

crespocarlos commented 1 year ago

hey @roshan-elastic , we can try to reproduce the error to ensure it's not happening on the newer versions.

It's just that if the bug still exists, it's probably something that needs to be fixed by the beats team.

roshan-elastic commented 1 year ago

Cheers @crespocarlos.

So if we test and it still shows, you think we should maybe assign this to them (or maybe create a separate issue that is linked to this - and I can help push with them)?

elasticmachine commented 1 year ago

Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)

crespocarlos commented 7 months ago

@roshan-elastic could we forward this issue to the agents team?

roshan-elastic commented 7 months ago

Hey @crespocarlos, sorry for the delay on this. I tried replicating this on Linux and Windows and everything worked fine:

Link to testing (internal only)

I'm going to close this one off as we can't replicate this (it's possible a release of agent at some point has resolved the issue the customer has seen).

Cheers for looking at it.