Closed yorickdowne closed 1 year ago
Once it gets into the state on the left of the graph, I can shut down the app and leave exporter running, and it does not recover. If I shut down exporter, it recovers. This is the reason I am thinking that exporter goes into some form of pathological retry loop.
Over a longer period of time. It may be some kind of "unhappy coincidence" that gets it into that state and won't let it come out again. You can see it can take weeks before the issue occurs.
I am grasping as to why ethereum-metrics-exporter may cause this, tbh. Here are the logs for one day prior to shutting it down. All I see is some "invalid voluntary exit event" messages.
Hang loose, please. It is not this clear-cut. I re-enabled exporter, saw the issue again, disabled exporter, and issue did not go away. Swung stader guardian over, and issue gone.
Previously I had disabled guardian and the issue only cleared when exporter was disabled. So ... two issues for the price of one? Guardian causes slowdown, then exporter goes into a tailspin?
I don't know yet. It'll take a little time to get a better sense of how the code querying Lighthouse interacts, whether there's one piece that's causing issues or multiple.
Guardian is definitely to blame and we are tackling that part. It's possible that metrics-exporter still has an infinite retry problem that makes things worse, but once we have Guardian fixed we won't be able to trigger that any more. Closing, can re-open if it ever rears its head again.
Observed with Lighthouse treestates, though this may also happen without.
This is a Lighthouse that gets fairly heavy queries every 5 min, where queries will take >4s to return. If ethereum-metrics-exporter runs, this eventually becomes pathological, where queries take >18s.
I am wondering whether there's something in exporter that starts hammering Lighthouse with retries when queries take a long time.
Here's a screenshot of Lighthouse's P1 REST API return time. 10s is really >= 10s. You can see where metrics exporter was turned off. The spikes every 5 min are the heavy REST queries by an app.