Ability to transform data returned from cached reports

dherder commented 1 week ago

Problem

Today Fleet is great at returning data via live queries and scheduled queries. Scheduled queries have improved such that we are able to collect and store results in Fleet rather than just export to a logging pipeline. These are called cached query reports.

As a Security or IT admin, I would like to be able to query across these reports. For example, I want to build a query report that gets information about the kernel version of all of my devices. With the resulting data in the cached report, I want to be able to derive further insights from that data by transforming with SQL. I don't want to have to build separate queries that run on devices to obtain these insights. Using the above example, SELECT COUNT(*) FROM "kernel_version_query_data_table" WHERE version > "23.6.0" would allow the admin to easily count the distribution of all the hosts at a particular kernel version.

What have you tried?

Today, customers leverage products like wuzah that have dashboards that sit on top of the osquery data. Additionally, customers export this data out to a logging pipeline and transform it with automation and data management (BI) layers.

nonpunctual commented 1 week ago

I think it would be ideal if these results were simply cached as a table & that a single query could parse the results as in @dherder 's example. The product "implies" this is possible by returning filtered results in a UI table.

Still good, but less than ideal, would be a live query UI or even a search bar on the returned results of any query with exportable results for the 2nd query.

nonpunctual commented 1 week ago

noahtalerman commented 1 week ago

I want to be able to derive further insights from that data by transforming with SQL. I don't want to have to build separate queries that run on devices to obtain these insights.

This sounds like a really powerful feature.

Using the above example, SELECT COUNT(*) FROM "kernel_version_query_data_table" WHERE version > "23.6.0"

@dherder in this example, I think you can achieve this in a workaround way by clicking on the filter above the version column and typing "23.6", "23.7", "23.8", ... until there are no results. This is definitely a workaround but I just want to call out that the current filtering could be used for some use cases in a pinch.

Today, is there a third-party SQL tool in which we could import the results? Can you write SQL queries in Google Sheets?

noahtalerman commented 1 week ago

cc @nonpunctual ^^

nonpunctual commented 1 week ago

@noahtalerman There 0 scenarios in which your workaround would be viable for prospect-salix at scale.

"Clicking" in the UI was the 1st thing mentioned by them as an objection to the current capability.
For 100s or 1000s of devices it's not practical.
There are no data export options even if clicking was viable.

dherder commented 1 week ago

@noahtalerman In addition to @nonpunctual's comments, I'd like to add that the example quoted above is really just a simple example that illustrates the problem from a data perspective. A more realistic example would be a JOIN on a specific cached query result table to another cached query result "temp table". This cross-result ETL would be a huge advantage for all of our customers.

noahtalerman commented 1 week ago

A more realistic example would be a JOIN on a specific cached query result table to another cached query result "temp table". This cross-result ETL would be a huge advantage for all of our customers.

Thanks @dherder! What is prospect-salix trying to do exactly?

I'm trying to understand the problem they're running into today before we get into the "how" (solution).

noahtalerman commented 6 days ago

Thanks for the use case @nonpunctual!

count how many computers are on version 14.6.1

What if I type "14.6.1" into the "version" column? That would filter results and I would see how many results are there up here:

Screenshot 2024-09-11 at 9 50 43 AM

every row in the visible "table" is counted once on a separate row (because each row represents a host)

I might be misunderstanding your comment but to clarify, each row represents a result (not a host). A host can return many results.

Although in this example, I think it will always be 1 results to 1 host. I'm guessing this is a query to get the macOS version.

I realize this example might not get at all the use cases the prospect is trying to achieve.

Assuming I'm understanding correctly that the macOS version use case is possible, are there other use cases that aren't possible? I think we want to understand the exact problem they're running into today before we get into the "how" (solution).

nonpunctual commented 6 days ago

@mikermcneil @noahtalerman @alexmitchelliii @dherder @zwass @getvictor @lukeheath

Example:

If the data returned to Fleet could be queried, SELECT count(*)& GROUP BY version would sum the total number of computers on macOS version 14.6.1.

Instead, I get a count of 1 for each unique host. This is expected behavior in Fleet today. The count of 1 for each host is correct. Each 14.6.1 match is being counted on each host.

Results are filtered for the Fleet UI view, but, "DOA" for further parsing.

This is counterintuitive given how subselects & CTEs work if this query were run against an actual SQL db. Thanks.

This is what prospect-salix is asking about on this call: https://us-65885.app.gong.io/call?id=2382501646065789366&highlights=%5B%7B%22type%22%3A%22SHARE%22%2C%22from%22%3A745%2C%22to%22%3A780%7D%5D

SQL queries that run directly on a SQL database can use common table expressions or subselects
- The returned result of a CTE is effectively a table
- The returned result of a subselect is effectively a table
When a query in Fleet is run a result appears as a “row” in a “table” in Fleet UI
- "row" & "table" are intentionally put in quotes for this example
- The "Results" count = the number of hosts that match the query & the selected Fleet parameters
- A single host can't return many "Results"
- A single host CAN return as many columns as the query specifies
The results of a query as returned to Fleet are not a actually a table
- The columns can be sorted
- But, the returned data can't be parsed any further by a SQL query

mikermcneil commented 5 days ago

@nonpunctual call?

fleetdm / fleet

Ability to transform data returned from cached reports #21846

Problem

What have you tried?