fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3k stars 416 forks source link

Don't restart fleetd when osqueryd restarts #18005

Open getvictor opened 6 months ago

getvictor commented 6 months ago

Goal

User story
As an end user,
I don't want to see Fleet Desktop disappear/reappear when osquery restarts due to an expensive query triggering the watchdog
so that Desktop doesn't seem broken.

Context

Changes

Product

Engineering

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Manual testing steps

  1. Step 1
  2. Step 2
  3. Step 3

Testing notes

Confirmation

  1. [ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. [ ] QA (@____): Added comment to user story confirming successful completion of QA.
noahtalerman commented 5 months ago

Victor: Expose to the user what query is triggering the osquery watchdog.

noahtalerman commented 5 months ago

Hey @getvictor I updated the issue description to use the user story format. I pulled your original issue description here for safe keeping:

Problem

When osqueryd worker is stopped, fleetd should not restart.

W0325 09:09:18.191121 1797648384 watcher.cpp:424] osqueryd worker (43241) stopping: Memory limits exceeded: 791101440 bytes (limit is 200MB)

This issue came out of debug for https://github.com/fleetdm/fleet/issues/17827 and is related to https://github.com/fleetdm/fleet/issues/18004

Potential solutions

fleetd should restart osqueryd without restarting itself (and desktop). If osqueryd restarts are frequent, perhaps fleetd can flag the host in Fleet (so that admin can look into this further) and/or enable some debug capabilities for osqueryd.

I see 2 problems here:

  1. fleetd (Orbit and Fleet Desktop) restarting when osquery restarts causes a poor end user experience.
  2. When osquery is restarting due to an expensive query triggering the watchdog, we don't surface this to the IT admin.

This story covers problem 1 while the following story covers 2:

I think we can ship this story before we ship #18004

Why? While we can make it clearer which policy/query is denylisted and when, the IT admin can still determine that this might be happening when a host isn't updating host vitals. Then, they can check out the osquery logs to find the problem policy/query.

I added #18004 to feature fest.

noahtalerman commented 4 months ago

Hey @getvictor this didn't make the 3-week drafting => estimation timeline. Bringing it back to feature fest.

noahtalerman commented 4 months ago

Hey @getvictor I updated the user story and the changes we want to make in the issue description.

Let me know if you have any thoughts/feedback.

Goal is to bring this one through the next estimation session.

sharon-fdm commented 4 months ago

Need to estimate the risk here. We think som code areas assume reset of everything so need to check no other bugs will be created.