Does the Cb ThreatHunter API cap Process search results like the UI does?

carbonblack / cbapi-python

Carbon Black API - Python language bindings

Other

147 stars 86 forks source link

Does the Cb ThreatHunter API cap Process search results like the UI does? #230

Closed crahan closed 4 years ago

crahan commented 4 years ago

During testing in both the Cb ThreatHunter UI and Python API, we're running into cases where the UI notes 'X results. Showing Y. Refine search to view additional results'.

In one example the UI says '3,413 results. Showing 516. Refine search to view additional results' for the query (process_name:cmd.exe) AND device_timestamp:[2020-05-09T18:42:44 TO 2020-06-08T18:42:44] -enriched:True.

Screen Shot 2020-06-08 at 20 46 23

Running the same query through the Python API returns 527 Process objects when looping over the iterator. This appears to indicate that the API is also not returning the full result set.

The limitation of 527 and 516 results seems strange as a previous query (on a different Cb ThreatHunter instance) yielded the message '12,213,072 results. Showing 18,500. Refine search to view additional results' and a lot more results via the API as well. If the platform is able to return 18,500 results in that particular case, why is it limiting far smaller resultsets like 3,413 results to only 516?

UI_limited_results

My question is, how can we be sure when running a process search via the Cb ThreatHunter Python API that we're receiving all available results and not just a small subset? Our use case is that we want to retrieve things like all PowerShell download activity and then group those results based on properties which are not available as facets in the platform (i.e. parent_name, process_cmdline, etc). Unless we can be sure we're receiving all results within the requested timeframe, the data isn't telling us the full story.

mikethecanuck-cb commented 4 years ago

When you search using the UI or its associated API, the API responses will always report how many matches the search had scoped (all found results), but can usually only - in a performant manner - return a subset of those found results.

Over large data sets, the API response will generally skew towards a subset of found results. To obtain more of that data, you have a few options:

Narrow the search - to obtain a greater proportion of matching results via API, it is necessary to further narrow the search query (query terms, selected filters and/or time range)
Use the Investigate export capability (coming soon) to obtain a larger proportion of the found results
Use the CBC Event Forwarder to receive all events as they're reported from your sensors

crahan commented 4 years ago

Thanks for the response Mike.

For item 1, while I understand the performance argument, if the platform can return 18,5k results when the total result set is 12 million, why can't it return a full 3k result set? Knowing on what to narrow down requires having the data to make that determination. parent_name and process_cmdline aren't available as a facet as far as I now. So the only way to narrow search results using exclusions based on those fields requires reviewing what's available to filter on first.

In my testing (as per the example above) the UI reported 3,413 results but only showed a subset of 516 results. The same query via the API returned 527 results. REST API results for the Process Search v3 API do indeed return num_found and num_available fields indicating what can be retrieved vs what is available, but how do I access this information using the Python API? Running a process query via the Cb ThreatHunter process query functionality gives me an iterator on which I can request len(), but that value seems to align with the smaller 516 results from the UI test, not the larger 3,413 results.

https://github.com/carbonblack/cbapi-python/issues/229 also makes things worse as Process search results are duplicated for each watchlist hit instead of having a single Process object with a watchlist_hit property containing a single list with all the watchlists and reports that matched. By returning those processes as duplicates, each with a single watchlist match, those 516 or 527 results (depending on UI vs API) contain several entries with exactly the same data except for the watchlist hit information. This effectively reduces unique process results by a factor of 3 or 4 depending on the amount of watchlist hits for each process. In Cb Response we had a group_by() option in the Python API (and the UI). This is not available in Cb ThreatHunter.

avanbrunt-cb commented 4 years ago

To give some context for why num_found differs from num_available. The reason a small num_found such as 1.3k may not all be available is that each searcher has a unique set of data and a maximum number of results that it can fetch. As such, if 1.3k of data is heavily distributed to a single searcher then that searcher will max out its capacity and the remaining will cause the num_found to be greater than num_available.

As for watchlist_hit duplication and the lacking of a group_by, we have passed this to our backend team for further investigation

jrhackett commented 4 years ago

Hi @crahan. Sorry for the late response here but couple things:

Yes, the API caps results the same way the UI does. Prior to the release of the "Search at Scale" feature and still in v1, we limited the number of returned results to N per distributed searcher. That resulted in the situations you're seeing where some searches return 18k results and some return less than 3k. Now in v2, we limit to the number of rows you request up to 10k. If you ask for 500 results, you'll see up to 500 results. If you ask for 10k, you'll see up to 10k. Right now we hard cap at 10k but we're hoping to lift that at some point. I'd suggest migrating any API work to v2 if you haven't already and specify rows when you create the search job.
The watchlist details from #229 is also much better on v2 now, since the "Search at Scale" initiative has shipped. You'll only see one result per process/event. We've essentially implemented the group by in the v2 APIs by default, assuming almost no one wants to see the ugly details of segments.
You can facet on parent_name. If you don't see it in the UI, click the 3 dots above the facet panel to turn it on.

crahan commented 4 years ago

Thanks for clarifying @jrhackett. What I'm noticing when using the v2 search updates that were just merged in the development branch is that the default max number of rows returned is 500 (probably the default value for rows when rows is absent). When testing directly with the REST API it appears, that while v1 used rows and start to allow getting the next batch of results until num_available was reached, v2 always sets num_available equal to the value or rows that was requested.

So when I make a request that specifies rows as 100, I get 100 results back (as expected) but num_available is also 100 (even if there's more available). The way it should work is that num_found reports how many are in the system, num_available is the max number that can be retrieved (currently capped at 10000), and the processes returned equal the value requested via rows.

Here's what I get back when I request process_name:cmd.exe without a value for rows and there's more than 10000 results in the server:

"num_found":44539, "num_available":500, "contacted":14, "completed":14

The expected outcome (in my mind) should be:

"num_found":44539, "num_available":10000, "contacted":14, "completed":14

And then 500 processes in the results. In the second case I know to keep making API calls for the next batch of 500 results until I've retrieved all 10000 for num_available. If the server sets num_available to match the value I select for rows there's no way to retrieve 10000 results in batches of whatever number the user prefers and we're forced to set rows in our requests to the max value of 10000 by default (to get the most value).

jrhackett commented 4 years ago

I think the issue may be where you're providing rows. You need to provide the 10k when creating the search job and then you use start/rows when getting results to page. For example:

Provide 10k for this route: https://developer.carbonblack.com/reference/carbon-black-cloud/cb-threathunter/latest/process-search-v2/#start-a-process-search-job

Then page though the results using whatever start/rows combo you want using this route: https://developer.carbonblack.com/reference/carbon-black-cloud/cb-threathunter/latest/process-search-v2/#get-process-search-results

If you don't supply rows to the first route, it'll default to 500.

crahan commented 4 years ago

I'll make a PR for the below code to set the default value for rows in the _submit() function to 10000 by default (args['rows'] = 10000 after line 417):

https://github.com/carbonblack/cbapi-python/blob/a19985e211d07c240c5a46f4c38f12a03c208595/src/cbapi/psc/threathunter/query.py#L413-L426

Initial testing shows that this will cap the max available results to 10000 as expected. Looking at the API calls in the UI, it seems to be how search queries are submitted there as well. The retrieval of the search results can remain at batches of 10 I think (because those are parameters on the search result retrieval endpoint as you already mentioned):

https://github.com/carbonblack/cbapi-python/blob/a19985e211d07c240c5a46f4c38f12a03c208595/src/cbapi/psc/threathunter/query.py#L491-L497

jrhackett commented 4 years ago

Cool, I'll leave @avanbrunt-cb and/or the rest of the developer relations team to review that change. In the meantime, I think we can close this issue. Feel free to reopen if necessary.