Split YR_MAX_ATOM_LENGTH for string and re, and optimize the extract algorithm of regular atom to speed up the scan.

X3h1n commented 2 months ago

Split YR_MAX_ATOM_LENGTH to YR_MAX_ATOM_RE_LENGTH and YR_MAX_ATOM_LENGTH will speed up the scan. For string, its length could increase from 4 to more large e.g. 8, and this change can speed up the scan. For re, if we just increase the length of re atom, the number atoms may increases exponentially because of wildcards and the memory will explode , so we optimized re atom extract algorithm. On the one hand, increase string YR_MAX_ATOM_LENGTH to speed up the scan. On the other hand, control the selected re atom contains at most one wildcard. After finishing atom extract on line 1164 of atoms.c , we count the number of wildcards in the re atom. If the count above 1, we will further extract from atom, the following is the explain.

find best child atom in every length from the atom, the range of length is [1, YR_MAX_ATOM_RE_LENGTH]. Every best child atom must meet 2 requirements, first it is the best quality, second the number of wildcards is below 1.
select one highest quality atom that dost not contain any wildcard (no-wildcard atom) and save it in best_atom[0].
select one highest quality atom that contains half wildcard ( half-wildcard atom) and save in best_atom[1]. For example, {01 0? 03} contains half wildcard.
select one highest quality atom that contains one wildcard (one-wildcard) and save in best_atom[2].
To avoid no-wildcard atom and half-wildcard atom length is too short, we set YR_MIN_VALID_ATOM_LENGTH. If no-wildcard atom length greater than or equal to YR_MIN_VALID_ATOM_LENGTH, we will select it as the final best atom and stop the select algorithm. Or else if the half-wildcard atom length greater than or equal to YR_MIN_VALID_ATOM_LENGTH+1, we will select it as the final best atom and stop the select algorithm. Or else we will choose the longest atom as the best atom.

By following the above optimization, the scanning speed was increased by 6 times in our test.

google-cla[bot] commented 2 months ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

plusvic commented 2 months ago

Did you measure the performance gain after these changes? In experiments I've done in the past I've concluded that increasing the maximum atom length beyond 4 or maybe 6 bytes doesn't increase, but the memory grows considerably.

Can you share details about the tests you did?

X3h1n commented 2 months ago

The rules used in the test consist of strings and opcode, most of which are strings, and there are about 10,000 rules. About 1.3 million samples were tested. Before modification, the average scan time was 0.594. With YR_MAX_ATOM_LENGTH set to 8 and YR_MAX_ATOM_RE_LENGTH set to 8, the average scan time is 0.095. In our tests, we found that increasing YR_MAX_ATOM_LENGTH has no significant effect on the string atom. But increasing the length of re atom increases memory dramatically. Therefore, we have modified the extraction algorithm of re atom to limit the extracted atom to contain a maximum of one wildcard. Thus, the length of re atom does not have to be YR_MAX_ATOM_RE_LENGTH, but it certainly does not have a significant impact on memory.

X3h1n commented 2 months ago

Because YR_MAX_ATOM_LENGTH limits the length of strings and re, if only YR_MAX_ATOM_LENGTH is increased, the re atom may contain multiple wildcards, and the expansion of the re atom may reduce the gain in scan speed, while the memory will increase substantially. In our experiment, we're actually just increasing the length of the string atom. For re atoms, only the length of re atoms that meet the requirements (contains a maximum of one wildcard) is increased, otherwise there is no change compared with before modification.

plusvic commented 2 months ago

I still have some questions. You said that you have tested with a corpus of 10.000 rules, but most of them are plain strings and are not really affected by this change. Therefore there must be some regexps in that corpus which are the really slow ones, and those may have improved with you change.

I would like to take a look at those specific regexps that are slow with the current implementation and improve with your change, in order to understand better which type of regexps are benefitting from this change.

Also, notice that this change has broken some test cases.

X3h1n commented 1 month ago

Maybe my previous expression was not accurate. We increased the atom length of the string（YR_MAX_ATOM_LENGTH）, so scan speed of the string rules will increase. Because most of my test rules are string rules, my scanning performance is relatively good. However, our changes also take into account the possibility of re atom exploding, so the re atom selection algorithm has been optimized to accommodate the increased length of re atom.

VirusTotal / yara

Split YR_MAX_ATOM_LENGTH for string and re, and optimize the extract algorithm of regular atom to speed up the scan. #2107