Netflix-Skunkworks / Scumblr

Web framework that allows performing periodic syncs of data sources and performing analysis on the identified results
Apache License 2.0
2.64k stars 319 forks source link

Feature questions - Bulk queries, API, data analytics #4

Open ianshefferman opened 10 years ago

ianshefferman commented 10 years ago

Hi, my team is interested in using Scumblr, but we have a few questions:

  1. Are there plans to add support for bulk queries? For example, we may want to add a new search provider that will do a user search on Twitter. We may want to "follow" multiple users, and add them all in one bulk query, and then filter them as necessary on the results page. So basically some sort of class attribute that would make certain search providers turn the Query box into a multi-line text field, and do a separate search for each of those lines. On the search page it would ideally show up as one entry, but would just be a "multivalue search".

Depending on if my team agrees to go forward with this, I could possibly try and add that functionality myself and submit a pull request.

  1. Is there a supported way for getting search results out of the database, perhaps with an API of some kind? Or plans for such a feature. I could throw up something like Sandman or just write a script to pull data out of the sqlite database directly, but I'm wondering if you guys have any suggestions. An automated CSV export feature (appending all search results to a CSV each time a search completes) would probably do the job as well.

Basically, we'd like to use Scumblr more for the standardized data scraping and searching aspect, and then offload the results to a more powerful analytics platform (like Splunk) for further research.

Thanks.

ahoernecke commented 10 years ago

Makes sense. For question 1, that's not currently possible but I'll add it to the feature request list. Currently each search is essentially one query.

For your second question, there is very rudimentary support now... If you hit /results.json or /results/search.json it will return the results in json format instead of the entire web page. Similarly /results/RESULT_ID.json will give the same information for a single result. Additionally you should be able to do something like: /results/search.json?saved_filter_id=ID to get the results from an existing saved filter.

If you have any specific recommendations regarding what would be useful from an API perspective let me know. If you'd rather have some type of export at the end of a search let me know and I can think of some ways you might be able to easily do that...

ianshefferman commented 10 years ago

Thanks for the response.

For bulk queries, I know adding those would kind of break the current model. There are probably a few different ways they could be structured. Do you have more than one active developer working on this project at the moment, and do you know if you could add a feature like this in the near future?

If not, I could try my hand at adding that feature and submitting a pull request. How I would probably do it: add multi as a boolean field for Search, and break Search#perform_search into 2 separate methods, performing a new search for each line in the query if it's a multi search. The newline character could be converted to some other delimiter in the database, maybe |. Might also be useful to add an extra multi_index integer field for Result so you can see exactly which query is tied to that result for multi searches, and access it by something like search.query.split(delimiter)[multi_index].

The "cleaner" way of doing it would be to just create one new search object for every line in the query, but that would pollute the searches page with a ton of rows, unfortunately.

For getting data out, the JSON endpoints for each page mostly satisfies our requirements.

ahoernecke commented 10 years ago

What exactly is your motivation for having multiple queries under one search? I thinking that a simpler way to accomplish most of what you've mentioned is:

  1. An easier way to make multiple search at the same time that would be identical except for the query used. So a fast way to make 10 different twitter searches for these 10 strings.
  2. An enhanced search list page (with some basic sorting/filtering)
  3. Applying tags to the searches as appropriate. (This can already be done)

The other way I could imagine this being done "easily", although not necessarily the most elegant solution, would be overloading the search provider(s) and having the search provider itself split the query and run multiple searches.

ianshefferman commented 10 years ago

What you suggested is definitely viable and will probably work well for what we want as long as the search page allows easy filtering. Could also automatically add a default "bulk" tag to any bulk search that is made.

It's not necessary for us to have multiple queries under one search, I was just concerned about viewing and editing searches being a mess after doing a bulk search that may look for 50 or more keywords. An enhanced search page would do the job. I also recommend showing the search tags next to each search on the search list page.

ahoernecke commented 10 years ago

Sounds doable, thanks!

AtJofo commented 9 years ago

Figure I'd add my two cents into this issues rather than adding another as it falls in line with the "bulk" search option. My use (security company with clients) is likely atypical from a user tracking a particular organization. Ideally, I'd like to have a few different lists of items as search criteria, and only trigger an alert if some degree of union is met. For example: Company/Acronym list: Bank of Madeup, BoMu Netflix, NFLX Amalgamated Widgets Inc., AM, AMI etc

"Attack" keyword list: DDoS, hack, own, owned, pwn, pwned, loic, deface, defaced, etc

Attack Group: YourAnonGlobal, YourAnonNews, UG, TheRedHack, etc

The unions that I'd care about would be: Company + Keyword, Company + Group, and Company + Keyword + Group, with the last one being of highest confidence.

We're achieving this right now by following the various twitter accounts and scraping the tweets with in-house code, but it's not very pretty.

ahoernecke commented 9 years ago

Hi @AtJofo,

I think this is a somewhat unique use-case. It would probably be relatively easy to build a custom search provider to do this though: https://github.com/Netflix/Scumblr/wiki/Extending-Scumblr#search-providers

Andy

bensmith83 commented 9 years ago

I'd second this use case. Would be very useful to be able to see unions of hits from various keyword lists.