Closed crahan closed 4 years ago
When you search using the UI or its associated API, the API responses will always report how many matches the search had scoped (all found results), but can usually only - in a performant manner - return a subset of those found results.
Over large data sets, the API response will generally skew towards a subset of found results. To obtain more of that data, you have a few options:
Thanks for the response Mike.
For item 1, while I understand the performance argument, if the platform can return 18,5k results when the total result set is 12 million, why can't it return a full 3k result set? Knowing on what to narrow down requires having the data to make that determination. parent_name
and process_cmdline
aren't available as a facet as far as I now. So the only way to narrow search results using exclusions based on those fields requires reviewing what's available to filter on first.
In my testing (as per the example above) the UI reported 3,413 results but only showed a subset of 516 results. The same query via the API returned 527 results. REST API results for the Process Search v3 API do indeed return num_found
and num_available
fields indicating what can be retrieved vs what is available, but how do I access this information using the Python API? Running a process query via the Cb ThreatHunter process query functionality gives me an iterator on which I can request len()
, but that value seems to align with the smaller 516 results from the UI test, not the larger 3,413 results.
https://github.com/carbonblack/cbapi-python/issues/229 also makes things worse as Process search results are duplicated for each watchlist hit instead of having a single Process object with a watchlist_hit
property containing a single list with all the watchlists and reports that matched. By returning those processes as duplicates, each with a single watchlist match, those 516 or 527 results (depending on UI vs API) contain several entries with exactly the same data except for the watchlist hit information. This effectively reduces unique process results by a factor of 3 or 4 depending on the amount of watchlist hits for each process. In Cb Response we had a group_by()
option in the Python API (and the UI). This is not available in Cb ThreatHunter.
To give some context for why num_found differs from num_available. The reason a small num_found such as 1.3k may not all be available is that each searcher has a unique set of data and a maximum number of results that it can fetch. As such, if 1.3k of data is heavily distributed to a single searcher then that searcher will max out its capacity and the remaining will cause the num_found to be greater than num_available.
As for watchlist_hit duplication and the lacking of a group_by, we have passed this to our backend team for further investigation
Hi @crahan. Sorry for the late response here but couple things:
rows
when you create the search job.Thanks for clarifying @jrhackett. What I'm noticing when using the v2 search updates that were just merged in the development branch is that the default max number of rows returned is 500 (probably the default value for rows when rows is absent). When testing directly with the REST API it appears, that while v1 used rows
and start
to allow getting the next batch of results until num_available
was reached, v2 always sets num_available
equal to the value or rows
that was requested.
So when I make a request that specifies rows
as 100, I get 100 results back (as expected) but num_available
is also 100 (even if there's more available). The way it should work is that num_found
reports how many are in the system, num_available
is the max number that can be retrieved (currently capped at 10000), and the processes returned equal the value requested via rows.
Here's what I get back when I request process_name:cmd.exe
without a value for rows
and there's more than 10000 results in the server:
"num_found":44539, "num_available":500, "contacted":14, "completed":14
The expected outcome (in my mind) should be:
"num_found":44539, "num_available":10000, "contacted":14, "completed":14
And then 500 processes in the results. In the second case I know to keep making API calls for the next batch of 500 results until I've retrieved all 10000 for num_available
. If the server sets num_available
to match the value I select for rows
there's no way to retrieve 10000 results in batches of whatever number the user prefers and we're forced to set rows
in our requests to the max value of 10000 by default (to get the most value).
I think the issue may be where you're providing rows
. You need to provide the 10k when creating the search job and then you use start
/rows
when getting results to page. For example:
Provide 10k for this route: https://developer.carbonblack.com/reference/carbon-black-cloud/cb-threathunter/latest/process-search-v2/#start-a-process-search-job
Then page though the results using whatever start/rows combo you want using this route: https://developer.carbonblack.com/reference/carbon-black-cloud/cb-threathunter/latest/process-search-v2/#get-process-search-results
If you don't supply rows to the first route, it'll default to 500.
I'll make a PR for the below code to set the default value for rows
in the _submit()
function to 10000 by default (args['rows'] = 10000
after line 417):
Initial testing shows that this will cap the max available results to 10000 as expected. Looking at the API calls in the UI, it seems to be how search queries are submitted there as well. The retrieval of the search results can remain at batches of 10 I think (because those are parameters on the search result retrieval endpoint as you already mentioned):
Cool, I'll leave @avanbrunt-cb and/or the rest of the developer relations team to review that change. In the meantime, I think we can close this issue. Feel free to reopen if necessary.
During testing in both the Cb ThreatHunter UI and Python API, we're running into cases where the UI notes 'X results. Showing Y. Refine search to view additional results'.
In one example the UI says '3,413 results. Showing 516. Refine search to view additional results' for the query
(process_name:cmd.exe) AND device_timestamp:[2020-05-09T18:42:44 TO 2020-06-08T18:42:44] -enriched:True
.Running the same query through the Python API returns 527 Process objects when looping over the iterator. This appears to indicate that the API is also not returning the full result set.
The limitation of 527 and 516 results seems strange as a previous query (on a different Cb ThreatHunter instance) yielded the message '12,213,072 results. Showing 18,500. Refine search to view additional results' and a lot more results via the API as well. If the platform is able to return 18,500 results in that particular case, why is it limiting far smaller resultsets like 3,413 results to only 516?
My question is, how can we be sure when running a process search via the Cb ThreatHunter Python API that we're receiving all available results and not just a small subset? Our use case is that we want to retrieve things like all PowerShell download activity and then group those results based on properties which are not available as facets in the platform (i.e.
parent_name
,process_cmdline
, etc). Unless we can be sure we're receiving all results within the requested timeframe, the data isn't telling us the full story.