I had a single YARA file with north of 10000 rules in it that was taking an very long time to process with the validator. I suspect this may also be playing a role in the AL YARA service's general ability to ingest new rule updates and potentially even scan files with large rulesets. Using 'alive_bar' (not in PR) to track progress/troubleshoot, I zoned in on the loop here:
Within this loop I noticed that the 'scheme' and 'yara_config' elements of the YaraValidator class required the corresponding files to be read in each and every time a rule is processed. This is pretty I/O intensive and likely the root cause for increased time over large rule files for validation. As you can see, it took ~22m on my Ubuntu VM.
time yara_validator big_rule_file.yar -v
... truncated for brevity ...
----------------------------------------------------------------------------
All .yara Rule files found have been passed through the CCCS Yara Validator:
Total Yara Rule Files to Analyze: 1
Total Valid CCCS Yara Rule Files: 0
Total Warning CCCS Yara Rule Files: 1
Total Invalid CCCS Yara Rule Files: 0
---------------------------------------------------------------------------
real 22m26.453s
user 20m37.731s
sys 0m4.802s
A proposed solution here is to bring the creation of the YaraValidator object out of the loop, as well as the ingest of the config files. At this point we can proceed as normal within the for loop, we just need to ensure that the 'count' for required fields is reset I believe since we are working on a 'per rule' basis now. So, the counts should be evaluated as such as far as I can tell.
The perf improvement is pretty significant, going from 22m to ~9s on my little machine.
Run after changes (added -w to ignore warnings but with it it's about the same, plus a few secs for stdout :) ).
time yara_validator big_rule_file.yar -vw
... truncated for brevity ...
----------------------------------------------------------------------------
All .yara Rule files found have been passed through the CCCS Yara Validator:
Total Yara Rule Files to Analyze: 1
Total Valid CCCS Yara Rule Files: 1
Total Warning CCCS Yara Rule Files: 0
Total Invalid CCCS Yara Rule Files: 0
---------------------------------------------------------------------------
real 0m9.682s
user 0m8.711s
sys 0m0.239s
[@cccs-rs]
Perf Improvement on 'validator.py':
I had a single YARA file with north of 10000 rules in it that was taking an very long time to process with the validator. I suspect this may also be playing a role in the AL YARA service's general ability to ingest new rule updates and potentially even scan files with large rulesets. Using 'alive_bar' (not in PR) to track progress/troubleshoot, I zoned in on the loop here:
Within this loop I noticed that the 'scheme' and 'yara_config' elements of the YaraValidator class required the corresponding files to be read in each and every time a rule is processed. This is pretty I/O intensive and likely the root cause for increased time over large rule files for validation. As you can see, it took ~22m on my Ubuntu VM.
A proposed solution here is to bring the creation of the YaraValidator object out of the loop, as well as the ingest of the config files. At this point we can proceed as normal within the for loop, we just need to ensure that the 'count' for required fields is reset I believe since we are working on a 'per rule' basis now. So, the counts should be evaluated as such as far as I can tell.
The perf improvement is pretty significant, going from 22m to ~9s on my little machine.
Run after changes (added -w to ignore warnings but with it it's about the same, plus a few secs for stdout :) ).