Closed roshan-elastic closed 7 months ago
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)
Note - I'm speaking with George about why this might be. It looks like a known issue with how we calculate the host.network.ingress.bytes
field in Elastic that we may need to work around.
We can tackle during refinement.
Hey @nimarezainia,
Would love to get your take/thoughts on this...(context in the issue description).
It seems that agent hooks onto a lifetime counter that hosts maintain to calculate things like host.network.ingress.bytes
.
When someone reboots the host, it appears those counters are reset and the fields we use (host.network.ingress.bytes
) then show a massive spike in ingress/egress. This then results in large inaccuracies in the data in our UI (see issue description).
I was wondering if you have any thoughts/advice on how this could be handled?
@roshan-elastic wouldn't it be better if this ticket ticket had been opened to the beats team? Our team can't do much about it if the problem happens on the metrics collection side.
@cmacknz when the host reboots, should we be reading zero as well or do we ignore zero values?
should we be reading zero as well or do we ignore zero values?
We may not read zero, but we will read a value that is lower than it was previously. This should be something we can detect and account for.
This actually sounds familiar, what version is this? There was a PR to fix this in https://github.com/elastic/beats/pull/35977 the original issue was https://github.com/elastic/beats/issues/35944.
Looks like that commit would be in 8.10 so if this isn't an 8.10.0-SNAPSHOT version then it won't have it.
@roshan-elastic @crespocarlos is this something that you have reproduced? I don't see a version mentioned here in this issue but I suspect not an 8.10. per @cmacknz comment (thanks Craig) could we try the newer image?
Hi @nimarezainia , I haven't reproduced this issue in particular.
I reported this https://github.com/elastic/beats/issues/35944, but it has been fixed and I haven't seen the problem again since then. The problem here looks the same.
We'll have to see which version was being used when this error was reported. By looking at the UI, I would say that at least Kibana is running on 8.8. @roshan-elastic is on PTO, I'll confirm that when he's back.
Thanks everyone!
@crespocarlos - when we pick up this ticket, is there a way we can test the behaviour to see if we can get the bug to show again?
Based on my understanding, if we were to run a load of network activity on a host and then reboot it, we should see weird numbers with the Rx/Tx if this bug were still present...
Not sure the best way to approach this but I'm thinking that if we try to break it and it still works OK, we can probably close this unless a users reports this on one of the newer versions (where it should be fixed)...
hey @roshan-elastic , we can try to reproduce the error to ensure it's not happening on the newer versions.
It's just that if the bug still exists, it's probably something that needs to be fixed by the beats team.
Cheers @crespocarlos.
So if we test and it still shows, you think we should maybe assign this to them (or maybe create a separate issue that is linked to this - and I can help push with them)?
Pinging @elastic/obs-ux-infra_services-team (Team:obs-ux-infra_services)
@roshan-elastic could we forward this issue to the agents team?
Hey @crespocarlos, sorry for the delay on this. I tried replicating this on Linux and Windows and everything worked fine:
Link to testing (internal only)
I'm going to close this one off as we can't replicate this (it's possible a release of agent at some point has resolved the issue the customer has seen).
Cheers for looking at it.
🔗 Key Links
Issues
📖 Description
Network RX/TX byte sizes are showing peculiar activity for a customer when they change time range:
Last 15 mins - Kb
Last 15 mins - Pb
Apparently this is due to well known issue with how the
host.network.ingress.bytes
andhost.network.egress.bytes
fields are calculated in Observability (in general - not just Elastic).Discover - records showing a huge spike due to incorrect field calculation
Lens - showing these trended over time...look at the spikes...
✔️ Acceptance criteria
What must this feature have?
1. Must Have
Must be delivered in this issue in order for the release to be valuable
2. Should Have
3. Could Have
Would be nice to have but not critical
4. Will Not Have (for now)
Explicitly will not be looked at within this issue
📈 Telemetry Process