Closed mikermcneil closed 7 months ago
I like this idea a lot. Some questions/ideas
additional_queries
that you currently offer?Now that you'll have these queries immediately stored and accessible for all queries by default, it doesn't make as much sense to keep that functionality still. Perhaps you can port these additional queries upon upgrade to this new methodology and then keep them in the current settings so people can migrate over.
Policies
into it?If you're already going to store this data for all queries by default, perhaps the mutation of the data could be done after the fact.
Example: SELECT 1 from screenlock WHERE grace_period LIKE '5' LIMIT 1;
Could be turned into
select grace_period from screenlock;
and then you add additional rules within the query UI as a value type returned and the value you expect. By doing it this way, you can now do the following
Doing it this way would greatly speed up the time from converting queries to policies.
More feedback / related conversations in the wild:
ok yeah, I understand a bit better, but then what differentiates a scheduled query vs adhoc? sounds like theyre both scheduled. As an aside, larger fleets running this would probably need larger cache clusters.
but then what differentiates a scheduled query vs adhoc?
Basically every query would be scheduled by default, in terms of collecting data automatically (only the most recent result for each host). If you want to turn that off, you could still do so, and use it only for traditional live querying (with the target picker where you select hosts) And then, like how policies and vulnerability automations work, your control over the flow of data into your log destination would be controlled by your "query automations". So you can still choose whether to have results flow into the log destination or not on a query by query basis.
As an aside, larger fleets running this would probably need larger cache clusters.
Totally. That's zwass's thinking too. We'd need to do some smart things to help make it clear what the impact of running a query is, and only maintainers would be able to author new queries. The data would likely be in MySQL or Redis. We want to avoid adding another infra dependency for folks to contend with, if possible.
I think that makes sense to me in some aspects. Ideally there should just be “queries” that you can schedule, or you can run in real-time (ad-hoc). If I’m going to back “what problem are you trying to solve”? It’s essentially scheduling queries to collect data as hosts come online to Do Things With™
Hey @mike-j-thomas when you get the chance, can I please get your help on the following UI changes?
The "How?" section in this issue's description gives a longer walk through on what we're trying to accomplish with these changes.
As a user writing/testing a query, I don't need to see "Frequency," "Platforms," and "Minimum osquery version" options. This is because these options are used to adjust/tune how often and on what hosts the query runs. I want to adjust/tune these settings after I've tested and saved the query.
As a user viewing my queries results, I don't need to see the SQL editor. This is because the SQL editor is used when I'm writing/testing a query. I want to edit my query if I want to update the SQL to remove a column or add a column.
Current drafted UI changes for the Query page are here: https://www.figma.com/file/hdALBDsrti77QuDNSzLdkx/%F0%9F%9A%A7-Fleet-EE-(dev-ready%2C-scratchpad)?node-id=9114%3A293112
Hey @mike-j-thomas heads up, please ignore the first set of UI changes (number 1 in the above comment). These UI changes are no longer relevant.
The second UI change (number 2) is still relevant. It would be great to get your help with this.
Number 1 is no longer relevant because we decided to remove the "Frequency," "Platforms," and "Minimum osquery version" options from the UI.
Feedback from Mike McNeil on current Figma wireframes (2022-08-30).
Hey @noahtalerman, is this feedback for me? If it is, I need to schedule a time to discuss it with you.
@mike-j-thomas this comment is feedback for me: https://github.com/fleetdm/fleet/issues/6716#issuecomment-1234302935
More feedback from a senior detection and response engineer:
this is VERY cool.
I love the idea of caching the results, and never having to leave the platform.
latest result is something to go on, that gives you the idea if something was different the last time you checked.
@mikermcneil heads up, I've moving your "How?" and "Example scenario" sections below for safekeeping (removed from issue description). This is because I'd like the issue description to reflect the latest plan.
Here is a short video showing how this could fit into the Fleet user interface: https://www.loom.com/share/9772acb4a37a4556a69c27bb990c5501
Rather than adding custom widgets on the logged-in homepage, or adding more surface area / sprawl to the product through additional custom reporting, there's another way to approach this that we've discussed before. Here's a fresh take on how we might execute that:
fleetctl
are imported as global queries going forward, instead of as query packs, unless a special flag is used.I wondered: Can @GuillaumeRoss see login attempts to my computer? Could a Fleet user create a world where, if their 1 year old pounding on the keyboard causes 20 failed login attempts, a policy automation gets triggered?
I decided to run a query against Fleet's macOS laptops, just to see what came back.
Some observations:
- Since only two computers (mine and Reed's) were online when I ran the query, and Fleet's live queries only return data from devices that are currently online right now, my experience was a little disappointing.
- Like, I could set up a policy in Fleet to do some simple "more than 5 login attempts or no?" reporting on this, and I could use policy automations to set up an alert when there are more
- And that's all great.. but to get the motivation to do that, I want to be able to explore what the data looks like (like, what's normal? What's Tim's computer like? Is this number higher for people with young kids?)
- So then it made me think "ok I can just make a scheduled query, then I'll explore it in my log destination" (but then I'm leaving Fleet to go into some other SIEM software to set up a custom report, and then waiting for the scheduled queries to run before I see anything useful)
There's gotta be a better way.
I removed this issue from the roadmap board because it will just sit on the board until the above is prioritized.
I pulled the following feedback on "See query results on the Host details page" (phase 3) out of the product design review doc (noahtalerman 2022-10-13):
@dherder
@zhumo I'm assigning this to us as the DRIs for moving this back through the design/spec process.
This is going to be a backend performance intensive feature. We should endeavor to design the UX and engineer the backend to minimize this.
Which of the two ways to run queries should this be backed with? My instinct is distributed queries. The primary reasons being:
Any of the above concerns could be addressed by changes in osquery if we found another important reason to use scheduled queries.
The possible advantage of scheduled queries is that it seems useful to be able to process differential results to update the datastore. I suggest we do not try to do this, however. See next section:
My experience with storing high volumes of data in MySQL is that the primary problem we have run into is lock contention issues. Because of this, I think we should try to make host checkins be append-only or at least try to avoid transactions (eg. inserts are expected and definitely don't make any updates). It might make sense to have the API then return only the most recent result(s) from each host and have some sort of background cleanup job for the old results to keep those queries (and the potential locks they induce) out of the critical path of host checkin.
Additionally, we may want to run some experiments on how MySQL write throughput compares when inserting each row returned from osquery as a separate row in MySQL vs. storing all of the osquery rows together in a JSON column in MySQL.
Generally I think @sharon-fdm's approach for result storage is a good design, but I don't think it yet specifies some of the things discussed above, and this may be best informed by running some experimentation.
If we don't have size limits on the results, inevitably we will see performance problems (occasionally severe) when people try to store huge amounts of data (intentionally or not). If we do have size limits, inevitably someone will come up with a use-case that cannot be achieved. Pick our poison, I think. My preference would be to set a limit (either number of rows or number of bytes of data) because I'd rather see feature disappointment than outages. If there are limits we will want to be careful about the UX so that users understand when and how data is truncated (potential questions: If a host returns too much data do we include some rows but truncate the rest? Do we throw out all of the rows for that host? Is the limiting even on a per-host basis? How can I tell when I've exceeded limits?)
@zwass Thanks for the info! Are distributed queries the same as live queries? (i.e., a query I run on demand and get responses from online hosts only)
Essentially, yes. The distinction is that the distributed query APIs are what Fleet uses to implement "live queries". Distributed queries are the mechanism by which osquery asks on a recurring basis for any queries to run immediately. Within Fleet this mechanism is used for live queries, host vitals, and policies.
This issue is a parent epic which organizes the child stories, but should not go on the product board. Each phase is the child story which will be designed and shipped.
Removing the product label from this issue. This issue is a parent epic which captures all phases in a child story. Each child story should go through the board.
All queries now yield, Insights across all hosts gleaned, Cloud city's data field.
UPDATE: Closed this issue because all of the stories included in this issue are shipped.
(noahtalerman 2024-04-12)
Problem
When I run a live query in Fleet, I only see data for the hosts that are online right now. On an average Monday morning, only 20% of my hosts are online.
This makes it hard to see query results for all of my hosts.
Please watch: https://www.loom.com/share/9772acb4a37a4556a69c27bb990c5501
Goal
Add ability to see latest query results for all my hosts so that I can explore, ask questions, obtain insights, and plan automations, without having to go into another tool (my log destination).
Parent epic
7079
Related
UI Children
7765 (phase 1)
7766 (phase 2)
CLI children
6024