Open getvictor opened 6 months ago
Victor: Expose to the user what query is triggering the osquery watchdog.
Hey @getvictor I updated the issue description to use the user story format. I pulled your original issue description here for safe keeping:
Problem
When osqueryd worker is stopped, fleetd should not restart.
W0325 09:09:18.191121 1797648384 watcher.cpp:424] osqueryd worker (43241) stopping: Memory limits exceeded: 791101440 bytes (limit is 200MB)
This issue came out of debug for https://github.com/fleetdm/fleet/issues/17827 and is related to https://github.com/fleetdm/fleet/issues/18004
Potential solutions
fleetd should restart osqueryd without restarting itself (and desktop). If osqueryd restarts are frequent, perhaps fleetd can flag the host in Fleet (so that admin can look into this further) and/or enable some debug capabilities for osqueryd.
I see 2 problems here:
This story covers problem 1 while the following story covers 2:
I think we can ship this story before we ship #18004
Why? While we can make it clearer which policy/query is denylisted and when, the IT admin can still determine that this might be happening when a host isn't updating host vitals. Then, they can check out the osquery logs to find the problem policy/query.
I added #18004 to feature fest.
Hey @getvictor this didn't make the 3-week drafting => estimation timeline. Bringing it back to feature fest.
Hey @getvictor I updated the user story and the changes we want to make in the issue description.
Let me know if you have any thoughts/feedback.
Goal is to bring this one through the next estimation session.
Need to estimate the risk here. We think som code areas assume reset of everything so need to check no other bugs will be created.
Goal
Context
Changes
Product
Engineering
QA
Risk assessment
Manual testing steps
Testing notes
Confirmation