mikermcneil commented 2 years ago

UPDATE: Closed this issue because all of the stories included in this issue are shipped.

(noahtalerman 2024-04-12)

Problem

When I run a live query in Fleet, I only see data for the hosts that are online right now. On an average Monday morning, only 20% of my hosts are online.

This makes it hard to see query results for all of my hosts.

Please watch: https://www.loom.com/share/9772acb4a37a4556a69c27bb990c5501

Goal

Add ability to see latest query results for all my hosts so that I can explore, ask questions, obtain insights, and plan automations, without having to go into another tool (my log destination).

By "latest" we mean...when a host responds to the query with new results, Fleet drops the host's old results and only shows the new results.

Parent epic

7079

Now that you'll have these queries immediately stored and accessible for all queries by default, it doesn't make as much sense to keep that functionality still. Perhaps you can port these additional queries upon upgrade to this new methodology and then keep them in the current settings so people can migrate over.

What if we took your idea and made it even more and also merged `Policies` into it?

If you're already going to store this data for all queries by default, perhaps the mutation of the data could be done after the fact.

Example: SELECT 1 from screenlock WHERE grace_period LIKE '5' LIMIT 1;

Could be turned into

select grace_period from screenlock; and then you add additional rules within the query UI as a value type returned and the value you expect. By doing it this way, you can now do the following

Admin runs a test query and gets the data they want
Admin decides they like the query and deploys it company wide
Manager asks admin to turn this query into a policy
Admin adds the new mutation logic within the FleetUI and Fleet immediately processes all saved device data for pass/fail criteria

Doing it this way would greatly speed up the time from converting queries to policies.

Would these queries show up in the device API and streamed to kafka?

mikermcneil commented 2 years ago

Def
Good point. Maybe a wave 2 thing to consider, just due to the “additional rules” (right now, SQL is interpreted only on the monitored host)
It would show up in the Fleet API, and if data collection automations are enabled for the query, then it would flow to the log destination (ie kafka) each time captured (where differential or snapshot is a property of the automation)

mikermcneil commented 2 years ago

More feedback / related conversations in the wild:

https://twitter.com/gnawshark/status/1547378744360402945?s=21

mikermcneil commented 2 years ago

6:48 AM Contributor from F1000 organization

ok yeah, I understand a bit better, but then what differentiates a scheduled query vs adhoc? sounds like theyre both scheduled. As an aside, larger fleets running this would probably need larger cache clusters.

2:02 PM mikermcneil

but then what differentiates a scheduled query vs adhoc?

Basically every query would be scheduled by default, in terms of collecting data automatically (only the most recent result for each host). If you want to turn that off, you could still do so, and use it only for traditional live querying (with the target picker where you select hosts) And then, like how policies and vulnerability automations work, your control over the flow of data into your log destination would be controlled by your "query automations". So you can still choose whether to have results flow into the log destination or not on a query by query basis.

As an aside, larger fleets running this would probably need larger cache clusters.

Totally. That's zwass's thinking too. We'd need to do some smart things to help make it clear what the impact of running a query is, and only maintainers would be able to author new queries. The data would likely be in MySQL or Redis. We want to avoid adding another infra dependency for folks to contend with, if possible.

2:05 PM Contributor from F1000 organization

I think that makes sense to me in some aspects. Ideally there should just be “queries” that you can schedule, or you can run in real-time (ad-hoc). If I’m going to back “what problem are you trying to solve”? It’s essentially scheduling queries to collect data as hosts come online to Do Things With™

noahtalerman commented 2 years ago

Hey @mike-j-thomas when you get the chance, can I please get your help on the following UI changes?

The "How?" section in this issue's description gives a longer walk through on what we're trying to accomplish with these changes.

As a user writing/testing a query, I don't need to see "Frequency," "Platforms," and "Minimum osquery version" options. This is because these options are used to adjust/tune how often and on what hosts the query runs. I want to adjust/tune these settings after I've tested and saved the query.
As a user viewing my queries results, I don't need to see the SQL editor. This is because the SQL editor is used when I'm writing/testing a query. I want to edit my query if I want to update the SQL to remove a column or add a column.

Current drafted UI changes for the Query page are here: https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/%F0%9F%9A%A7-Fleet-EE-(dev-ready%2C-scratchpad)?node-id=9114%3A293112 Screen Shot 2022-08-22 at 8 22 44 PM

noahtalerman commented 2 years ago

Hey @mike-j-thomas heads up, please ignore the first set of UI changes (number 1 in the above comment). These UI changes are no longer relevant.

The second UI change (number 2) is still relevant. It would be great to get your help with this.

Number 1 is no longer relevant because we decided to remove the "Frequency," "Platforms," and "Minimum osquery version" options from the UI.

noahtalerman commented 2 years ago

Feedback from Mike McNeil on current Figma wireframes (2022-08-30).

Add spinner after updating a query from "draft" to "ready to run"
Add “Frequency” options into the “Manage automations” modal for queries.
Packs can only point to global queries
As a team maintainer and team admin, I can no longer edit global queries. This is breaking (not in the automation sense). I can edit/delete any query that belongs to my team. I can only see queries that belong to my team, or queries that show up under "Inherited queries"
I can’t transfer queries between teams for now.
All existing queries are migrated as global queries.
"platforms" and "min osquery version" and "logging style" options can no longer be set in the UI.
“shard” can no longer be set at all.
Team-level observers and maintainers can only see the queries of the teams they're on.
- This might help with a customer's goal of allowing some folks author SQL, without necessarily having access to all the other things. (They can use team maintainers on a canary team)
APIs related to packs? Are maintained backwards compatible.
APIs related to the old "Schedule" idea? TODO

mike-j-thomas commented 2 years ago

Hey @noahtalerman, is this feedback for me? If it is, I need to schedule a time to discuss it with you.

noahtalerman commented 2 years ago

@mike-j-thomas this comment is feedback for me: https://github.com/fleetdm/fleet/issues/6716#issuecomment-1234302935

mikermcneil commented 2 years ago

More feedback from a senior detection and response engineer:

this is VERY cool.

I love the idea of caching the results, and never having to leave the platform.

latest result is something to go on, that gives you the idea if something was different the last time you checked.

noahtalerman commented 2 years ago

@mikermcneil heads up, I've moving your "How?" and "Example scenario" sections below for safekeeping (removed from issue description). This is because I'd like the issue description to reflect the latest plan.

How?

Here is a short video showing how this could fit into the Fleet user interface: https://www.loom.com/share/9772acb4a37a4556a69c27bb990c5501

Rather than adding custom widgets on the logged-in homepage, or adding more surface area / sprawl to the product through additional custom reporting, there's another way to approach this that we've discussed before. Here's a fresh take on how we might execute that:

Turn every query in Fleet into a report.
- [ ] Combine the "Queries" and "Schedule" page into a single experience.
- [ ] All saved queries default to being scheduled queries.
- [ ] When saving a new query in Fleet, it defaults to being "Scheduled"... but not "automated" for external data collection. In other words, it will be run on an hourly basis, just like policies and host vitals. The latest results from each host are cached in Fleet, and visible on the query console when you click into a query. If a host is offline when the query runs, the most recent results from last time it successfully responded to this query are retained.
- [ ] When you open up a query, you see (in the query console UI) the most recent results from the last time it ran successfully on each targeted host.
- [ ] You can still run it again, as a live query. The "Run query" button becomes "Preview query", with a tooltip reassuring that you'll be able to choose where the query runs. Other than when previewing, the target picker goes away altogether.
  - [ ] When viewing the results from previewing a query, you can toggle off "View cached results for offline hosts" to see the results the way they look in Fleet 4, today. But otherwise, you'll see cached results even after previewing the query manually.
- [ ] An additional refetch button, like on the host detail page, is added to the query console, to let you grab the latest results from any online hosts. The target picker does not open.
- [ ] And instead, there's UI (like when editing a policy today) to let you limit where the query will run for compatibility reasons, which become persistent and sticky, before you hit the "Save" button, letting you target only certain:
  - [ ] operating systems
  - [ ] osquery versions
- [ ] The "shard" option is either removed altogether, or included here as well.
- [ ] You are presented with a reminder of how many hosts are in your fleet (if global) or team (if team-level), and of the fact this query will run on all of them hourly, before you hit the "Save" button.
  - [ ] If you turn on operating system or osquery version filters, then the count updates live, prior to saving, to reflect this (like in the target picker today)
  - [ ] Save users from themselves in large deployments. If a query about to be saved (whether globally or in a team) would affect 100+ hosts, display a warning message to the user in the user interface.
- [ ] When a new query is saved, it runs immediately (whether previewed first or not), and starts running once per every hour.
- [ ] To configure or disable this, you can find an optional feature that lets you edit the query's frequency, or even choose "Never" to make it a "draft query".
  - [ ] Many Fleet users will not need to use this option, and will generally be helped by not having to think about it. A good way to make this configurable might be to take the "This query will run every hour" message and make "hour" into a dropdown.
  - [ ] A draft query is a query that never runs at all without you triggering it, i.e. just a saved query that never runs automatically; that's only used as an on-the-fly informational live query (like today). But since a saved query that never runs is only useful if all the hosts you're interested in are online right now, in the moment where you think to run it, this is no longer the default.
  - [ ] Draft queries are indicated clearly as such in the "Queries" list view, and hovering indicates what that means ("This query will not be run automatically until you publish it.")
- [ ] Newly saved queries no longer automatically capture data to the SIEM (log destination).
- [ ] Bring an "Automations" modal to the "Schedule" (aka now the same as "Queries") page (global and per-team) and have that be how you connect the data to your log destination (your SIEM; e.g. Splunk).
- [ ] Move the "Example data" UI with sample JSON here instead.
- [ ] As well as the "snapshot versus differential" option.
[ ] YAML format changes to match
- [ ] Configuration-by-YAML and GitOps in general are first-class features and goals of the Fleet product. Many of Fleet's users (and paying customers) are already using GitOps to manage their policies, and the YAML configuration is how they experience the product.
- [ ] Instead of emphasizing "query packs", emphasize "scheduled queries" and "teams" in the YAML format.
- Continue to allow for scheduled queries and configuring scheduled query automations in YAML, but call them out as such, instead of framing config in terms of query packs, which exist in Fleet for the sake of compatibility and are only accessible to global admins.
- [ ] If a new query is added to the YAML without an interval, it starts running periodically by default and collecting results in Fleet on an hourly basis. If an interval is specified, then that interval is used instead.
- [ ] Automated data collection (to the log destination) is enabled by default/
- [ ] Automatic running of queries can be disabled with a flag
- [ ] This might have to be a breaking change, though it might also be possible to maintain backwards compatibility.
[ ] Team maintainers can now manage scheduled queries and data collection automations themselves.
- [ ] Instead of only query authors or global maintainers being able to edit and delete team queries, now any team maintainer can edit and delete queries, and any team maintainer can edit scheduled queries.
[ ] Query pack management continues to be supported for backwards compatibility, but only headlessly. (The UI goes away.)
- This combines "Schedule", "Queries", and "Advanced > Query packs" features in a way that still allows for running live queries on any mishmash of hosts or labels, but also gives users a simple way to see a report of their query results with better access controls and simpler UI.
[ ] Querypacks newly imported with fleetctl are imported as global queries going forward, instead of as query packs, unless a special flag is used.
[ ] Parlance is updated throughout docs.

Example scenario

I wondered: Can @GuillaumeRoss see login attempts to my computer? Could a Fleet user create a world where, if their 1 year old pounding on the keyboard causes 20 failed login attempts, a policy automation gets triggered?

I decided to run a query against Fleet's macOS laptops, just to see what came back.

Some observations:

Since only two computers (mine and Reed's) were online when I ran the query, and Fleet's live queries only return data from devices that are currently online right now, my experience was a little disappointing.

Like, I could set up a policy in Fleet to do some simple "more than 5 login attempts or no?" reporting on this, and I could use policy automations to set up an alert when there are more

And that's all great.. but to get the motivation to do that, I want to be able to explore what the data looks like (like, what's normal? What's Tim's computer like? Is this number higher for people with young kids?)

So then it made me think "ok I can just make a scheduled query, then I'll explore it in my log destination" (but then I'm leaving Fleet to go into some other SIEM software to set up a custom report, and then waiting for the scheduled queries to run before I see anything useful)

There's gotta be a better way.

noahtalerman commented 2 years ago

7766 and "See query results on the Host details page" are deprioritized.

I removed this issue from the roadmap board because it will just sit on the board until the above is prioritized.

noahtalerman commented 2 years ago

I pulled the following feedback on "See query results on the Host details page" (phase 3) out of the product design review doc (noahtalerman 2022-10-13):

Screen Shot 2022-10-13 at 5 29 43 PM

mikermcneil commented 1 year ago

@dherder

lukeheath commented 1 year ago

@zhumo I'm assigning this to us as the DRIs for moving this back through the design/spec process.

zwass commented 1 year ago

This is going to be a backend performance intensive feature. We should endeavor to design the UX and engineer the backend to minimize this.

Scheduled queries vs. distributed queries

Which of the two ways to run queries should this be backed with? My instinct is distributed queries. The primary reasons being:

Scheduled queries generate logs even when the host is offline. Fleet will then have to ingest entire batches of logs when we are only interested in the most recent state. This is doubly problematic because when ingesting any log request from a host we cannot be certain that the host does not have further logs (those generated from later runs of the query) buffered. That means that we might have to store values that we then immediately throw out.
The way that schedules work does not provide ideal ergonomics for this feature IMO. Scheduled query intervals run on "ticks" that only pass when the host is online, so if a host returns from a long period of offline it will not update immediately. As a user I would want it to update immediately.
Fleet currently allows almost all features to be used even if scheduled query logs are sent to any of osquery's built-in logging plugins besides the typical TLS (which logs through Fleet). If we use distributed queries, this feature will also work regardless of how osquery's logging is configured.

Any of the above concerns could be addressed by changes in osquery if we found another important reason to use scheduled queries.

The possible advantage of scheduled queries is that it seems useful to be able to process differential results to update the datastore. I suggest we do not try to do this, however. See next section:

Result storage in MySQL

My experience with storing high volumes of data in MySQL is that the primary problem we have run into is lock contention issues. Because of this, I think we should try to make host checkins be append-only or at least try to avoid transactions (eg. inserts are expected and definitely don't make any updates). It might make sense to have the API then return only the most recent result(s) from each host and have some sort of background cleanup job for the old results to keep those queries (and the potential locks they induce) out of the critical path of host checkin.

Additionally, we may want to run some experiments on how MySQL write throughput compares when inserting each row returned from osquery as a separate row in MySQL vs. storing all of the osquery rows together in a JSON column in MySQL.

Generally I think @sharon-fdm's approach for result storage is a good design, but I don't think it yet specifies some of the things discussed above, and this may be best informed by running some experimentation.

Result size limits

If we don't have size limits on the results, inevitably we will see performance problems (occasionally severe) when people try to store huge amounts of data (intentionally or not). If we do have size limits, inevitably someone will come up with a use-case that cannot be achieved. Pick our poison, I think. My preference would be to set a limit (either number of rows or number of bytes of data) because I'd rather see feature disappointment than outages. If there are limits we will want to be careful about the UX so that users understand when and how data is truncated (potential questions: If a host returns too much data do we include some rows but truncate the rest? Do we throw out all of the rows for that host? Is the limiting even on a per-host basis? How can I tell when I've exceeded limits?)

lukeheath commented 1 year ago

@zwass Thanks for the info! Are distributed queries the same as live queries? (i.e., a query I run on demand and get responses from online hosts only)

zwass commented 1 year ago

Essentially, yes. The distinction is that the distributed query APIs are what Fleet uses to implement "live queries". Distributed queries are the mechanism by which osquery asks on a recurring basis for any queries to run immediately. Within Fleet this mechanism is used for live queries, host vitals, and policies.

zhumo commented 1 year ago

This issue is a parent epic which organizes the child stories, but should not go on the product board. Each phase is the child story which will be designed and shipped.

zhumo commented 1 year ago

Removing the product label from this issue. This issue is a parent epic which captures all phases in a child story. Each child story should go through the board.

fleet-release commented 7 months ago

All queries now yield, Insights across all hosts gleaned, Cloud city's data field.

fleetdm / fleet

Make every query in Fleet a useful report #6716

Problem

Goal

Parent epic

7079

Related

7765 (phase 1)

7766 (phase 2)

6024

Doesn't this also get rid of the need for `additional_queries` that you currently offer?

What if we took your idea and made it even more and also merged `Policies` into it?

Would these queries show up in the device API and streamed to kafka?