Open danielmoder opened 1 year ago
$mz at 0 would be transformed into a representation of uint16(0) == 0x5a4d. This wouldn't be the case for, e.g. $mz in (0..100), IIRC.
A quick trial shows a significant difference in runtime between $mz at 0
and uint16(0) == 0x5a4d
, which suggests this isn't the case (at least for v4.2.3). The data is obviously not representative of normal files, but it highlights the fact that the former is still searching the whole file for "MZ", which is the point I thought the rule was trying to show as the biggest performance hit.
Does this align with your understanding as well? I hadn't heard anything about this sort of post-processing/transformation, but I'm curious to hear more if you remember where you saw it.
import yara
yara.YARA_VERSION
> '4.2.3'
# Not realistic, just meant to highlight differences
sample = "MZ" * 10000
ruleset_fixed_string = """
rule fixed_string
{
strings:
$mz = "MZ"
$foo = "foo"
condition:
$mz at 0 and $foo
}
"""
ruleset_uint16 = """
rule uint16_string
{
strings:
$foo = "foo"
condition:
uint16(0) == 0x5A4D and $foo
}
"""
rule_fixed_string = yara.compile(source=ruleset_fixed_string)
rule_uint16 = yara.compile(source=ruleset_uint16)
%timeit matches = rule_fixed_string.match(data=sample)
> 76.1 µs ± 1.34 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit matches = rule_uint16.match(data=sample)
> 25.8 µs ± 920 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
Someone once told me that, but I don't remember who it was.
And the performance impact also depends on the scanned data. And "performance impact" has many shades. There is additional CPU cycles, additional memory usage ...
Using a short atom like "MZ" could have less impact than { 00 00 00 00 00 00 00 }.
In our tests $regex2 had an impressive performance impact.
BTW: we also found out just recently that malloc()
used in libmusl doesn't work well with YARA's PE module and it's use causes a lot of overhead. Using our own malloc() reduced scan duration by 30-30%.
What I mean is that measurements depend on many different input variables.
On line 25 of
bad_rule.yar
, the comment// short atom and not fixed with e.g. "$mz at 0" in the condition
is confusing, as I was under the impression that the engine would still search the whole file for that string regardless of any fixed location specified in the condition.Can you clarify why adding the condition
$mz at 0
would improve performance?