YARA Scan Performance for Large Files is poor - Challenging assumptions

I have spent considerable time understanding how YARA works, including how the Aho-corasick Automaton is created and traversed, as well as how the built-in modules (PE, DOTNET, etc.) work and how the YVM is involved. I have also reviewed several similar Issues reported by other users, such as the popular discussion that is often referred to: https://github.com/VirusTotal/yara/issues/1415.

And still, I'm not convinced YARA should settle for poor performance when scanning Large Files.

I have seen other users trying to convey the message, but because maybe a use case was missing, they were dismissed.

Here is one use case that in my experience and honest opinion, I believe is quite prevalent/popular and Not a corner case:

You have thousands of enterprise systems where you want to silently deploy and run YARA to spot either malware or vulnerabilities or any other kind of artifact/IoC you are interested in. I believe, so far, this is the main scenario YARA was created for.

Over the years, the SoC team of such organization has created hundreds of thousands rules, all of which are equally important. Now the problem arises when you run the scan over systems that contain thousands of large files. Nowadays, it's reasonable to expect users having large files such as bulky Installers, Big Data CSV files, High Resolution Photos, Heavy CAD files, or virtually anything beyond 100MB, right?

The problem is that when YARA faces a file larger than 100MB and is using a large set of rules (e.g. 100k ~ 300k), the individual file scan time is around 10 and up to 60 seconds (and this is measured on an i9 processor, so imagine an entry-level system). If the file is 1.5 GB, then the scan time can take several minutes. If you multiply this by the number of files a 1TB system may have, you can quickly realize how slow this process would be, which may defeat time-sensitive tasks (especially around detecting fresh vulnerabilities or active attacks).

It is also worth noticing that every single optimization trick has been already applied to the large set of rules (e.g., https://github.com/Neo23x0/YARA-Performance-Guidelines/) and that doesn't yield any significant performance difference, mainly because the problem is not the rules.

It is also important to mention that the SoC team is very capable of crafting super-specific rules that will short-circuit as soon as possible. However, this doesn't seem to matter to YARA because the way in which it works introduces performance penalties much bigger than any possible rule short-circuit optimization.

Consider for a moment a rule like the following one: { strings: $patternA = { 50 51 68 43 C4 AA 79 } condition: filesize < 100KB and $patternA in (0..0xff) }

You may think that because the rule is checking for the filesize and the existence of a few bytes/Atom in the very beginning of the file, it will be quite fast and exit right away most of the time when files do not satisfy either condition (you can argue which condition goes first, but it doesn't matter in this example). However, what happens is that YARA still takes a minute or more to scan a large file (think 1GB), so your rule optimization efforts are useless here.

This happens (as far as I understand) because of several reasons (in order of importance?): 1) The file is always parsed entirely, regardless of having rules potentially matching it or not. If the file is large, that's a lot of CPU-bound time. And the file is fully parsed because the Automaton wants to find all the matching Atoms from all the rules on a single pass. So far this is understandable, up to some extent. If the file is small, then this approach is efficient because the cost of parsing the entire file is lower than the gain introduced by the Aho-corasick search mechanism. If the file is large, and considering that MOST OF THE TIME you will NOT match any rule (that's how the real world behaves, you rarely match on malware right?), then you are wasting precious time going over GBs of data. 2) If you imported a module such as the PE module, this module will also parse the entire file regardless of how much you actually need for your rules. Again, if you have several rules using multiple aspects of the PE module, this makes sense. However, if the PE is large (150MB) and all your rules are "decent enough" to check for basic properties present in the header as first condition (e.g., is_64bit(), pe.is_dll(), etc.) then you are likely parsing the entire file unnecessarily for most cases (e.g., if I never use imports in any rule, why do I pay the cost of parsing imports for every single file). If your directory contains 100 EXE files and 1 DLL, you will do unnecessarily work for 99 EXE Files if your rules are only looking at 64-bit DLLs. Maybe a lazy-loading schema would be better? 3) If you imported two modules (e.g. PE and DOTNET), because Modules can't re-use other modules (or data from other modules), then these two modules will parse the entire file twice, independently. This essentially duplicates the amount of work. In reality, you would expect the DOTNET module to be able to re-use or extend the PE module for example. The more modules you add, the more performance degradation is perceived.

At this point you may argue that there will be a rule that eventually requires the full file parsing anyways (justification @ issue/1415). E.g. a rule like the following one: { strings: $stringA = "YaraIsGreat" $stringB = "YaraWillBeAmazing" condition: all of ($string*) }

And because of that, it doesn't make any difference if you exit right away (without parsing the entire file) because the "not optimized" rule above will penalize you anyways as it requires the full pass. While this is true and a valid argument, it is also true that it doesn't give the rule developer a chance to optimize if he/she wanted to!

If I don't care about performance and I want to pay the price of running a rule that is simply checking for 2 strings anywhere in a 10GB file, that's my loss. But if I want to do better and optimize my rules to short-circuit as soon as possible so that my performance is good, YARA won't let me! And I think this is where you can see how important it is in real world scenarios.

Unfortunately I haven't been able to devise an easy solution to this problem yet, and it most likely requires fundamental changes to YARA. However, I'm interested on knowing:

1) Is this a valid concern and use case that the community acknowledges? 2) If so, are there any workarounds or not-so-breaking changes we could implement (either within YARA or outside YARA) to help improve the performance on large files?

Kudos for the great work authors and contributors are making to YARA, it's a very powerful creation. 🥇

I understand you concerns, they are completely valid, but I would like to understand your case better. It seems like you have a lot of rules (e.g. 100k ~ 300k), but all of them are carefully crafted so that they don't need to fully scan such large files. What technique are you using for that? Do you include a filesize < something and ... condition in all of your rules? Do you enforce it somehow?

If that's the case, the approach could be implementing a logic that inspect a rule's condition and tells us if it has some limit in filesize, and what's the limit (assuming that the limit infinite for rules that don't use filesize < something). If all the rules are limited like that, we can completely discard any file that is larger than the higher limit found in the rules.

The issue with that is that this optimization would work as long as all your rules specify a maximum file size in the condition, if you have just one rule that doesn't, it won't work at all.

Thanks for providing your input!

To elaborate on how rules are being crafted, it is really an assorted collection of conditions designed to "exit as soon as possible". This includes, but is not limited to:

filetype (e.g., magic or simply doing something like uint16be(0x0) == 0x4657)
bytes at address (e.g. "token" at 100)
filesize
several PE properties that can be extracted from the very beginning of the file. E.g. is_dll, number_of_sections, is_32bit, linker_version.major, etc.
contextual data (read via custom modules). For example, if the file was downloaded from URL X (something we can pass to YARA as external variable)
Etc.

As you can see, each one of these (placed at the beginning of conditions for a short-circuit effect) will virtually eliminate the need of unnecessarily parsing an entire file most of the time. It is still ok if one or two files are completely parsed eventually, as long as the majority of the files are "skipped as soon as possible". The goal (in the enterprise example explained above) is to make YARA run decently on high volume scans, not perfectly on each individual file. So when you check for filetype, that usually requires a very simple inspection (we can argue how robust that is, but most of the time it's ok) of a few bytes. The bytes at address condition is providing the content and the location so it doesn't require any file traversal. The filesize also provides a nice scope, as it is rare that you will expect to find e.g. a family of malware in a 20GB file. The PE properties are a nice way of narrowing down the set of potential targets, and since the PE format places most of these properties in let's say the initial <50KB or less, you can quickly parse just that in a 2GB file.

Regarding your question on enforcement, yes we can enforce this. It is really nothing fancy, but in the same way the compiler will warn you about inefficient rules, we can simply check if a rule is as dumb as "string1 and string2" and warn the author about the consequences of such wide rule. I understand that if YARA was used for something like "finding emails in files", then all these optimizations wouldn't apply and you would have to pay the penalty. But for the most popular use case of spotting malware/malicious activity/vulnerabilities, the scope is quite narrow and you would never want to go over all the files in the same way, which is why I feel something is missing for this case.

Again, to provide a super clear example, if the systems I'm scanning contain hundreds of CAD, PSD, CSV, PNG and even large EXE files, the penalty of scanning these is quite high and the return of the investment is almost zero, because a CAD file will never be an object of interest for most or all my rules. Even in the case of the large EXE file, since all the rules will have short-circuit checks that can be resolved by just reading a few bytes from the file, I would even expect to quickly skip this file after 1 or 2 conditions (that are not string-based).

Following your suggestion on doing a pre-flight check on the filesize (could be something else as well), do you envision that being done within YARA (core) or outside YARA like a wrapper? Is "rule's condition inspection" something that one could interface with, or would that require modification of the YARA architecture?

Now I have a little more context about your case, but I still can't come up with an easy solution that would guarantee fast scanning on large files.

The main issue here has to do with reading the whole file and searching for all patterns before conditions can be evaluated. As I've mentioned before, the reason for that decision is that in most cases (specially when you have large files and/or a large number of rules) it's better to search for all the pattern at once in a single pass. I think that we both agree that this is a very important requirement, if you have a 10GB file you don't want to read the whole thing twice, or even worse: read the whole file for each rule or pattern.

But this design comes with an important drawback: you can't prevent YARA from scanning the whole file by using some predicates in your condition that don't rely on patterns, but in some other things that can be evaluated right away without reading the whole file.

Now let's think about what we can do to solve that issue without violating the single-pass requirement. Suppose that you have these two rules:

rule foo {
  strings: 
     $a = "foo"
  condition:
     uint16be(0x0) == 0x0000 and $a
}

rule bar {
  strings: 
     $a = "bar"
  condition:
     uint16be(0x0) == 0x1111 and $a
}

The author of rule foo may be thinking "this should be fast, it should search for "foo" all over the file only if the file starts with 0x0000", and the autor of bar can think the same thing. But the truth is that files either start with 0x0000 or 0x1111, but not both, so what should YARA do in this case?

Both rules look like they should be fast when considered independently, because it's easy to tell which is the condition that the file must meet. But when both rules are considered together things start getting tricky. You may say "well, is not that complicated, if the file starts with 0x0000 or 0x1111 then you search for the strings "foo" and "bar". That's easier said than done, because it implies that YARA would need to consider all the conditions together and somehow determine which are the global requirements for a file to be fully scanned or not. In this YARA should be able to realize that the file should be fully scanned if uint16be(0x0) == 0x0000 or uint16be(0x0) == 0x1111 . Now think in the same problem with 100K different and more complicated conditions. You can't decide whether to skip a file or not based in individual conditions, you need to take into account all the conditions at once and extract from them a "supercondition" that tells you if it's safe to skip the file or not.

The bottom line is that every trade-off has its drawbacks, and having the best of all worlds is sometimes imposible, or comes at the price of an incredibly high complexity.

Yes, the driving criteria for the current YARA architecture is reasonable and makes sense for the cases where rules are "simple" and files are "not too big". The problem is that because we need to support the use case of a rule saying "find string X anywhere in the file", then we need to pay a huge penalty when scanning large files even with rules that are "smarter" than that. Anyway, both requirements are valid but unfortunately colliding.

If we could do two passes, I believe we would get significant performance gain. Doesn't seem to be impossible, but I understand we may not be comfortable with the extra complexity.

If you do a pre-flight where you evaluate all the non-string/atom conditions first and only after a second (full) pass if there are still rules alive (meaning: initial pre-flight conditions were satisfied), then you would have skipped ~50% of your files (statistically speaking, based on how rules are crafted). I'm not that familiar with the details on the orchestration of the YARA execution workflow, but based on what I have seen, maybe it's feasible to simply evaluate all the rules first with some kind of "ignore" clause that could bypass anything that has to do with atoms. Even for a PoC we could say, comment out / ignore any condition that is different from "filesize" or "uint16be(0x0) == 0xXXXX" as a starting point, and grow from there. If that is achievable, then when you evaluate all the rules (before parsing the full file) you will get an idea on where you stand: do you need to continue parsing the entire file (and re-evaluate) or that was enough and you can abort?

Yes, you may still need to fully parse a few large files from time to time, but the goal is to skip as much as possible so that the overall scan time on a high volume scenario is more scalable and applicable to real world needs.

On a set of thousands of rules, the example would be: File A (1MB) --> First Pre-Flight Pass --> 560 rules are left --> Second (full) pass --> Match --> 18 ms File B (500MB) --> First Pre-Flight Pass --> No rules are left --> Skip the file --> 112 ms File C (233MB) --> First Pre-Flight Pass --> 90 rules are left --> Second (full) pass --> No match --> 1.5 seconds File D (1.5GB) --> First Pre-Flight Pass --> No rules are left --> Skip the file --> 120 ms -- Total scan time: 1.75 seconds (today this actually takes 20~60 seconds, depending on hardware capabilities)

Again, I understand it's not a simple change, but I believe that eventually YARA will need to face the challenge, especially with nowadays files getting bigger and not necessarily smaller.

Something along those lines could be done. I've thinking carefully about how to do it and the easiest solution would be generating a separate code stream for the YARA's virtual machine that evaluates the conditions before searching for the strings. The purpose of that evaluation is finding the rules that are either true or false independently of whether or not the patterns are found in the file. If all rules are either true or false in this first evaluation, we don't need to scan the file looking for patterns, but if there's at least one rule that can't be decided, we need to proceed with the pattern searching phase and then execute the final code.

My first idea was executing the same code twice, before and after the searching phase, but this is not possible because in some cases we may need to skip whole statements, like "for" loops. For example, if we have:

for all i in (0..#a) : ( ... )

We don't know the value of #a before searching for $a, and therefore the result of the whole loop is unknown. The loop statement must be considered undefined, as it's neither true nor false. But with the full code this loop can't be skipped.

This is the "easiest" solution, but it requires a lot of work. So far it seems possible, unless I find some ramifications that I'm not taking into account now.

It is my understanding that global rules are evaluated after the scanning phase but before any non-global rules. What if we had a prescan rule modifier that would be evaluated before the scanning phase and only if that rule was true would the scanning happen? The rule would look like a normal rule but strings would not be allowed, and conditions can use modules so you can do things like filesize > 1000 and pe.is_dll(). If all of the prescan rules return true then the scanning process will happen followed by the normal rule evaluation process.

Thanks for the suggestions. That seems to be a feasible pathway. I guess the only change would be "If at least 1 of the prescan rules return true then the scanning process will happen followed by the normal rule evaluation process." or in other words "If no prescan rule returns true, then abort".

Thanks for the suggestions. That seems to be a feasible pathway. I guess the only change would be "If at least 1 of the prescan rules return true then the scanning process will happen followed by the normal rule evaluation process." or in other words "If no prescan rule returns true, then abort".

Yeah, your idea of "at least one" is probably better. Either way, it may be a feasible path forward. I'd be curious to hear what Victor thinks.

That's an alternative, but the drawback is that existing rules won't benefit from the change. It also puts the burden on the user, who needs to understand how YARA works internally in order to fully comprehend what this feature is aimed for.

I lean towards solving the issue in a more general way, by automatically generating one of those "prescan" conditions for each rule.

VirusTotal / yara

YARA Scan Performance for Large Files is poor - Challenging assumptions #1665