csirtgadgets / massive-octo-spice

DEPRECATED - USE v3 (bearded-avenger)
https://github.com/csirtgadgets/bearded-avenger-deploymentkit/wiki
GNU Lesser General Public License v3.0
227 stars 60 forks source link

enhancements #481

Closed Nibor62 closed 7 years ago

Nibor62 commented 7 years ago

Hi (again :) ),

I am seriously thinking about using CIF for a big project that will create filtering rules based on tags. However there is some points that, I think, are missing or could be improved. I am posting it here because I still experiencing with CIFv2 but the final goal would be to move to v3 and therefore to port enhancements (if any).

Can you give me your thought about those ?

wesyoung commented 7 years ago

have you read and tried this?

https://github.com/csirtgadgets/massive-octo-spice/wiki/where-do-i-start-feeds

Nibor62 commented 7 years ago

I am reading these docs since a bit of time and I am actually using the default cif (which seem to be perl sdk)

Nibor62 commented 7 years ago

I was maybe a bit unclear. I think it would be useful to be able to retrieve tags from the data of the source. For example, Alienvault offer more precise data concerning the kind of malicious activity of the host than just 'suspicious'. And then it could be nice to be able to map these tags from the source's names to an unified list (CIF default tags list for example). It could be some optional thing like this to add to the source yaml configuration :

datamap:
  fieldNameToMap:
    valueFromTheSource : valueToMap

which could give for Alienvault source:

datamap:
  - tags:  datamap:
    tags:
      'Scanning Host': scanner
      'C&C': botnet
      'Malware Domain': malware
      'Malware distribution': malware
      ...
wesyoung commented 7 years ago

i see what you mean. traditionally what we've done is to "make multiple passes" at each of those tags and do the mapping manually, mostly because each of those tags means something different, and sometimes their tag indicates they don't really know what they're talking about and thus we have to lower the confidence a little bit for each tag.

we've tended to error on the side of "more duplicate config" than "trying to make the config too complex to understand" but i'm not opposed to testing out the idea, with those caveats in mind.

checkout out:

https://github.com/csirtgadgets/bearded-avenger/blob/master/rules/default/bambenek.yml

as an example of "multiple passes, changing the confidence and tags"

if you want to test that idea out; i'd do it here first:

https://github.com/csirtgadgets/csirtg-smrt-py

since that's getting closer to "beta". then if there's still a desire to do this sorta thing in v2, it can be back-ported.

also; the '--feed' flag on the client is worth messing around with. it de-dup's, applies a whitelist and many of our users use this to push [large] feeds directly into their network infrastructure for detection and blocking (it's the reason we built CIF in the first place).

Nibor62 commented 7 years ago

True enough it may get config file bloaty, but on the other hand it may allow you to provide default value or parse multiple values in a same field (typically Alienvault which sometime give multiple tags for a single entry) and prevent configuration redundancy.

I did a POC which work for Alienvault, however it is in V2. I can send you a patch or a PR in a branch. If it seem ok I could try to port it for v3.

Already tested '--feed' flag. My only issue is with it's aggregation/deduplication part, as what I would like to get a feed of IP but with all their tags (regardless of the tags issuers). I thought about doing the aggregation part on my side but I would need a way to "whitelist filter" a query. As for now it is only done inside the '--feed' part, I think that it could be interesting to add a '--whitelist' option to allow to use "whitelist filtering" everywhere when needed.

wesyoung commented 7 years ago

yea, send the PR we can bat it around.

re: tags.. ah. you want "a tag rollup" per indicator:

https://github.com/csirtgadgets/bearded-avenger/issues/247

which i'm not so sure about yet (at-least as the default). the problem with that is; not all tags carry the same confidence level. so while you might have one with a tag of "phishing" at confidence "85" and another (same indicator, diff provider) with a tag of "scanner" and "confidence: 65", things can get mis-leading.

i understand why some would want to do this sort of thing; but in my experience i don't think it's good practice. instead of relying on math and probabilities for "why we blocked a thing", we get into more of an echo chamber problem "we blocked it cause ... fake news!" (look! lots of tags! bad!) which can cause more problems and lead to a false sense of "confidence" when used improperly.

but also why we logged it as something to think about in v3 as "provide a 'tag rollup' field in the api" so you can chose if that's something you want.

Nibor62 commented 7 years ago

Eeyup, it is a rollup. I my idea, I was thinking of filtering by confidence before doing the rollup, so you can keep control on what you retrieve. Tags mostly act as classifiers (as an IP may be doing phishing and scanning), it is no the goal at all to filter on the number of tags (that doesn't look like a really good idea)

wesyoung commented 7 years ago

the longer term goal, we're building some machine learning hunters with sci-learn, numpy, etc to help create new feeds of stuff with higher confidence as [at-least my] long term solution around this. to me; tags are just tags of what we thought it was doing, confidence is confidence we have in that tag, but based on timestamps that stuff decays over time.

longer term is around probability and time decay, which i think is much more accurate. it's not to suggest that your thinking in this is wrong, just that i have a slightly different bent on the problem that i'm working towards [and why you might not see a lot of dev cycles around a simple roll-up].

at-scale; i don't really care all that much why something is suspicious, just that, if the math checks out, my infra blocks, throttles, wva. as long as my models are 'correct', tags almost mean nothing to me (other than to convey some semblance of what the data provider thought they observed, but even then, we've seen that it can be somewhat mis-leading).

hope that makes more sense. not opposed to implementing the idea (in an un-intrusive way), just not a goal of mine given what we can do with machine learning these days...

Nibor62 commented 7 years ago

That's make sense, but also make me quite curious (as the domain is passionating). What would you like to implement with machine learning ? Some way to guess higher confidence from data ?

Also, would you be opposed to some '--whitelist' tag, and some '--callback' option to execute a command at the end of an smrt update ?

wesyoung commented 7 years ago

re ML: yes. started by trying to ID what domains / urls look phishy based on some simple models. ip's are a bit tougher but once you get a bit of the ASN data in there, less hard.

there is already a whitelist tag (see the alexa examples and why you'd probably want to turn cif-worker back on, as well as enable metadata to help generate those whitelists ;)).

https://github.com/csirtgadgets/massive-octo-spice/wiki/Whitelist

--callback, i'm not really opposed to it.. if it's useful.

Nibor62 commented 7 years ago

What you want to achieve remind me a bit of Notos.

Know there is whiteList, I was more thinking of giving a way to apply this whitelist outside the '--feed' option, as you may want to do whitelist-filtered request/search

For what's about callback, think it can be handy when you want to notify clientApps that data have been updated.

wesyoung commented 7 years ago

re: notos- yes. similar ideas. not a new concept by any means, just operationalizing it in the open source space.

whitelists: ah. yea i guess that's never come up because, usually outside of --feed we've wanted that whitelisted data to always show up in a search. its about forensics and understanding everything about an indicators history (esp if it appeared on a google network). o/w you might be making a poor decision about a piece of data because you intentionally blinded yourself. :)

i could see where it may be useful as a flag though in some narrow situations.. similar to how --feed works, just add a --filter-whitelist flag that does what --feed does but applies that whitelist to your --search results.. easy enough in the sdk.

Nibor62 commented 7 years ago

Not sure to get what you mean by "easy enough in the sdk" (not a native-English), it is already feasible ?

wesyoung commented 7 years ago

checkout:

https://github.com/csirtgadgets/cif-sdk-py/blob/master/cifsdk/client.py#L455

it's a bit messy; something we're building into the v3 REST api:

https://github.com/csirtgadgets/bearded-avenger/blob/3.0.0a17/cif/httpd/views/feed/__init__.py#L47

if you were to follow the same logic for a normal -q with an additional flag, you'd get a similar effect. (it basically pulls the feed, then pulls the whitelist version of that feed, aggregates and strips things out of the data-set that are in the whitelist).

so the "pattern" exists in both the python SDK as well as the v3 REST API, if you wanted to apply that same pattern else-ware you at-least have some guidance.

Nibor62 commented 7 years ago

Thank for the explanations.

Which lead me to my last issue : if a source become offline (for hours or days), is there any way to prevent data from this source to disappear from cif feeds (without overextending feed period duration) ?

Nibor62 commented 7 years ago

and, also, are you describing how your confidence algorithm work (like when are you lowering it and from how many are you lowering it, ...) ?

Nibor62 commented 7 years ago

Hi, Any update here ?

wesyoung commented 7 years ago

re: feeds, no. since it's based on lasttime/reporttime, you need to extend the period. for --feed it's generally 30 days, so you're usually OK.

re: confidence

we're starting here https://github.com/csirtgadgets/bearded-avenger-deploymentkit/wiki/Where-do-I-start-Confidence

but haven't started documenting the newer models' we're working on (with python sklearn, etc). but we're gearing for the more obvious random forest models with a target of ~90%, which is only really a data problem (how many 'good' things can you throw at vs bad). these models are pretty well documented [generally]). the feature-sets are less so, but you should see those start bleeding in as we tune the models for things like entropy, distance, from other urls, domains, etc.

the ip-address confidence algo's are a bit diff, something we haven't dug too far into yet.