certtools / intelmq

IntelMQ is a solution for IT security teams for collecting and processing security feeds using a message queuing protocol.
https://docs.intelmq.org/latest/
GNU Affero General Public License v3.0
949 stars 296 forks source link

fast "Malware Name Mapping" (accelerate modify bot?) #1501

Open bernhardreiter opened 4 years ago

bernhardreiter commented 4 years ago

https://github.com/certtools/malware_name_mapping can be used to map malware names with the IntelMQ modify bot. However the 500 regexp rules are tried sequentially and this is quite resource intensive.

(One installlation uses a handful of parallel modify bots to spread the load.)

Can this be done faster? (To prove that we've reached an improvement, we should measure the throughput for a relevant test case before hand.)

Ideas

Check if the current modify bot can be made faster.

Python's regular expression implementation is slow for more complicated cases. Are the regexp precompiled once?

What does profiling tell us where the time is spend?

Use a different / improved algorithm to do the mapping

Maybe not using regular expressions.

aaronkaplan commented 4 years ago

You can take a look if the regexpes may be combined (that's a well researched problem in theoretical CS / compiler construction). Think: DFA.

In general, matching by regexpes can be made very fast via proper DFAs.

The second option would be parallelisation of course.

update: the more I think about it, the more I think a DFA approach should be tried first. i.e. yes, did we ever try to compile the regexpes to start with?

ghost commented 4 years ago

IMO also more generic performance improvements should be taken into consideration.

You can take a look if the regexpes may be combined (that's a well researched problem in theoretical CS / compiler construction). Think: DFA.

I guess this will impact readability of the rules negatively. Only 3% of the rules result in the same family name:

$ grep identifier /opt/intelmq/var/lib/bots/modify/malware-family-names.conf | sort | uniq -c | sort -n | grep '   1    ' | wc -l
1614
$ grep identifier /opt/intelmq/var/lib/bots/modify/malware-family-names.conf | sort | uniq -c | sort -n | grep -v '   1    ' | wc -l
50

Are the regexp precompiled once?

i.e. yes, did we ever try to compile the regexpes to start with?

Since version 2.0.1 it's there: https://github.com/certtools/intelmq/blob/ec0e911834a71c2c02c2c25ca0161758cf5b6a29/intelmq/bots/experts/modify/expert.py#L58-L64

What could work for this case is to generate the possible matching values for the regular expression like done here: https://github.com/certtools/malware_name_mapping/blob/4483d7daddefbfa78fd0480974eae352cf57dca9/scripts/tools.py#L20-L27 And then do lookups with a dictionary (hash table lookup). That could be faster.

bernhardreiter commented 4 years ago

@aaronkaplan @wagner-certat thanks for your hints.

I guess one of the next steps would be to to measure how fast the bot is and see how much the compilation of regexps already accelerates it.

bernhardreiter commented 4 years ago

When looking at the patterns another algorithmic approach seems good: We could use a prefix tree (also called trie) for the first two characters and then match the regular expressions. For this we could expand some of the patterns into several ones so that the first two characters are fixed. And then we use a jump table (or a python list) to find the remaining regular expression to match.

Using a fast DFA algorithm does probably not help us much as we have many patterns, and we need to know which one matches.

Calculating all possible match strings and use a dictionary is probably fast, in my guess using a trie or even a radix trie has to potential to be even a bit faster. ;)

All new algorithmic approaches basically mean that a specific bot is constructed.