fleetdm / fleet

Open-source platform for IT, security, and infrastructure teams. (Linux, macOS, Chrome, Windows, cloud, data center)
https://fleetdm.com
Other
3.11k stars 428 forks source link

Make every query in Fleet a useful report #6716

Closed mikermcneil closed 7 months ago

mikermcneil commented 2 years ago

UPDATE: Closed this issue because all of the stories included in this issue are shipped.

(noahtalerman 2024-04-12)


Problem

When I run a live query in Fleet, I only see data for the hosts that are online right now. On an average Monday morning, only 20% of my hosts are online.

This makes it hard to see query results for all of my hosts.

Please watch: https://www.loom.com/share/9772acb4a37a4556a69c27bb990c5501

Goal

Add ability to see latest query results for all my hosts so that I can explore, ask questions, obtain insights, and plan automations, without having to go into another tool (my log destination).

By "latest" we mean...when a host responds to the query with new results, Fleet drops the host's old results and only shows the new results.

Parent epic

Related

UI Children

CLI children

erikng commented 2 years ago

I like this idea a lot. Some questions/ideas

Doesn't this also get rid of the need for additional_queries that you currently offer?

Now that you'll have these queries immediately stored and accessible for all queries by default, it doesn't make as much sense to keep that functionality still. Perhaps you can port these additional queries upon upgrade to this new methodology and then keep them in the current settings so people can migrate over.

What if we took your idea and made it even more and also merged Policies into it?

If you're already going to store this data for all queries by default, perhaps the mutation of the data could be done after the fact.

Example: SELECT 1 from screenlock WHERE grace_period LIKE '5' LIMIT 1;

Could be turned into

select grace_period from screenlock; and then you add additional rules within the query UI as a value type returned and the value you expect. By doing it this way, you can now do the following

  1. Admin runs a test query and gets the data they want
  2. Admin decides they like the query and deploys it company wide
  3. Manager asks admin to turn this query into a policy
  4. Admin adds the new mutation logic within the FleetUI and Fleet immediately processes all saved device data for pass/fail criteria

Doing it this way would greatly speed up the time from converting queries to policies.

Would these queries show up in the device API and streamed to kafka?

mikermcneil commented 2 years ago
  1. Def
  2. Good point. Maybe a wave 2 thing to consider, just due to the “additional rules” (right now, SQL is interpreted only on the monitored host)
  3. It would show up in the Fleet API, and if data collection automations are enabled for the query, then it would flow to the log destination (ie kafka) each time captured (where differential or snapshot is a property of the automation)
mikermcneil commented 2 years ago

More feedback / related conversations in the wild:

mikermcneil commented 2 years ago

6:48 AM Contributor from F1000 organization

ok yeah, I understand a bit better, but then what differentiates a scheduled query vs adhoc? sounds like theyre both scheduled. As an aside, larger fleets running this would probably need larger cache clusters.

2:02 PM mikermcneil

but then what differentiates a scheduled query vs adhoc?

Basically every query would be scheduled by default, in terms of collecting data automatically (only the most recent result for each host). If you want to turn that off, you could still do so, and use it only for traditional live querying (with the target picker where you select hosts) And then, like how policies and vulnerability automations work, your control over the flow of data into your log destination would be controlled by your "query automations". So you can still choose whether to have results flow into the log destination or not on a query by query basis.

As an aside, larger fleets running this would probably need larger cache clusters.

Totally. That's zwass's thinking too. We'd need to do some smart things to help make it clear what the impact of running a query is, and only maintainers would be able to author new queries. The data would likely be in MySQL or Redis. We want to avoid adding another infra dependency for folks to contend with, if possible.

2:05 PM Contributor from F1000 organization

I think that makes sense to me in some aspects. Ideally there should just be “queries” that you can schedule, or you can run in real-time (ad-hoc). If I’m going to back “what problem are you trying to solve”? It’s essentially scheduling queries to collect data as hosts come online to Do Things With™

noahtalerman commented 2 years ago

Hey @mike-j-thomas when you get the chance, can I please get your help on the following UI changes?

The "How?" section in this issue's description gives a longer walk through on what we're trying to accomplish with these changes.

  1. As a user writing/testing a query, I don't need to see "Frequency," "Platforms," and "Minimum osquery version" options. This is because these options are used to adjust/tune how often and on what hosts the query runs. I want to adjust/tune these settings after I've tested and saved the query.

  2. As a user viewing my queries results, I don't need to see the SQL editor. This is because the SQL editor is used when I'm writing/testing a query. I want to edit my query if I want to update the SQL to remove a column or add a column.

Current drafted UI changes for the Query page are here: https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/%F0%9F%9A%A7-Fleet-EE-(dev-ready%2C-scratchpad)?node-id=9114%3A293112 Screen Shot 2022-08-22 at 8 22 44 PM

noahtalerman commented 2 years ago

Hey @mike-j-thomas heads up, please ignore the first set of UI changes (number 1 in the above comment). These UI changes are no longer relevant.

The second UI change (number 2) is still relevant. It would be great to get your help with this.

Number 1 is no longer relevant because we decided to remove the "Frequency," "Platforms," and "Minimum osquery version" options from the UI.

noahtalerman commented 2 years ago

Feedback from Mike McNeil on current Figma wireframes (2022-08-30).

mike-j-thomas commented 2 years ago

Hey @noahtalerman, is this feedback for me? If it is, I need to schedule a time to discuss it with you.

noahtalerman commented 2 years ago

@mike-j-thomas this comment is feedback for me: https://github.com/fleetdm/fleet/issues/6716#issuecomment-1234302935

mikermcneil commented 2 years ago

More feedback from a senior detection and response engineer:

this is VERY cool.

I love the idea of caching the results, and never having to leave the platform.

latest result is something to go on, that gives you the idea if something was different the last time you checked.

noahtalerman commented 2 years ago

@mikermcneil heads up, I've moving your "How?" and "Example scenario" sections below for safekeeping (removed from issue description). This is because I'd like the issue description to reflect the latest plan.

How?

Here is a short video showing how this could fit into the Fleet user interface: https://www.loom.com/share/9772acb4a37a4556a69c27bb990c5501

Rather than adding custom widgets on the logged-in homepage, or adding more surface area / sprawl to the product through additional custom reporting, there's another way to approach this that we've discussed before. Here's a fresh take on how we might execute that:

Example scenario

I wondered: Can @GuillaumeRoss see login attempts to my computer? Could a Fleet user create a world where, if their 1 year old pounding on the keyboard causes 20 failed login attempts, a policy automation gets triggered?

I decided to run a query against Fleet's macOS laptops, just to see what came back.

Some observations:

  • Since only two computers (mine and Reed's) were online when I ran the query, and Fleet's live queries only return data from devices that are currently online right now, my experience was a little disappointing.
  • Like, I could set up a policy in Fleet to do some simple "more than 5 login attempts or no?" reporting on this, and I could use policy automations to set up an alert when there are more
  • And that's all great.. but to get the motivation to do that, I want to be able to explore what the data looks like (like, what's normal? What's Tim's computer like? Is this number higher for people with young kids?)
  • So then it made me think "ok I can just make a scheduled query, then I'll explore it in my log destination" (but then I'm leaving Fleet to go into some other SIEM software to set up a custom report, and then waiting for the scheduled queries to run before I see anything useful)

There's gotta be a better way.

image

noahtalerman commented 2 years ago

7766 and "See query results on the Host details page" are deprioritized.

I removed this issue from the roadmap board because it will just sit on the board until the above is prioritized.

noahtalerman commented 2 years ago

I pulled the following feedback on "See query results on the Host details page" (phase 3) out of the product design review doc (noahtalerman 2022-10-13):

Screen Shot 2022-10-13 at 5 29 43 PM

mikermcneil commented 1 year ago

@dherder

lukeheath commented 1 year ago

@zhumo I'm assigning this to us as the DRIs for moving this back through the design/spec process.

zwass commented 1 year ago

This is going to be a backend performance intensive feature. We should endeavor to design the UX and engineer the backend to minimize this.

Scheduled queries vs. distributed queries

Which of the two ways to run queries should this be backed with? My instinct is distributed queries. The primary reasons being:

  1. Scheduled queries generate logs even when the host is offline. Fleet will then have to ingest entire batches of logs when we are only interested in the most recent state. This is doubly problematic because when ingesting any log request from a host we cannot be certain that the host does not have further logs (those generated from later runs of the query) buffered. That means that we might have to store values that we then immediately throw out.
  2. The way that schedules work does not provide ideal ergonomics for this feature IMO. Scheduled query intervals run on "ticks" that only pass when the host is online, so if a host returns from a long period of offline it will not update immediately. As a user I would want it to update immediately.
  3. Fleet currently allows almost all features to be used even if scheduled query logs are sent to any of osquery's built-in logging plugins besides the typical TLS (which logs through Fleet). If we use distributed queries, this feature will also work regardless of how osquery's logging is configured.

Any of the above concerns could be addressed by changes in osquery if we found another important reason to use scheduled queries.

The possible advantage of scheduled queries is that it seems useful to be able to process differential results to update the datastore. I suggest we do not try to do this, however. See next section:

Result storage in MySQL

My experience with storing high volumes of data in MySQL is that the primary problem we have run into is lock contention issues. Because of this, I think we should try to make host checkins be append-only or at least try to avoid transactions (eg. inserts are expected and definitely don't make any updates). It might make sense to have the API then return only the most recent result(s) from each host and have some sort of background cleanup job for the old results to keep those queries (and the potential locks they induce) out of the critical path of host checkin.

Additionally, we may want to run some experiments on how MySQL write throughput compares when inserting each row returned from osquery as a separate row in MySQL vs. storing all of the osquery rows together in a JSON column in MySQL.

Generally I think @sharon-fdm's approach for result storage is a good design, but I don't think it yet specifies some of the things discussed above, and this may be best informed by running some experimentation.

Result size limits

If we don't have size limits on the results, inevitably we will see performance problems (occasionally severe) when people try to store huge amounts of data (intentionally or not). If we do have size limits, inevitably someone will come up with a use-case that cannot be achieved. Pick our poison, I think. My preference would be to set a limit (either number of rows or number of bytes of data) because I'd rather see feature disappointment than outages. If there are limits we will want to be careful about the UX so that users understand when and how data is truncated (potential questions: If a host returns too much data do we include some rows but truncate the rest? Do we throw out all of the rows for that host? Is the limiting even on a per-host basis? How can I tell when I've exceeded limits?)

lukeheath commented 1 year ago

@zwass Thanks for the info! Are distributed queries the same as live queries? (i.e., a query I run on demand and get responses from online hosts only)

zwass commented 1 year ago

Essentially, yes. The distinction is that the distributed query APIs are what Fleet uses to implement "live queries". Distributed queries are the mechanism by which osquery asks on a recurring basis for any queries to run immediately. Within Fleet this mechanism is used for live queries, host vitals, and policies.

zhumo commented 1 year ago

This issue is a parent epic which organizes the child stories, but should not go on the product board. Each phase is the child story which will be designed and shipped.

zhumo commented 1 year ago

Removing the product label from this issue. This issue is a parent epic which captures all phases in a child story. Each child story should go through the board.

fleet-release commented 7 months ago

All queries now yield, Insights across all hosts gleaned, Cloud city's data field.