VirusTotal / yara

The pattern matching swiss knife
https://virustotal.github.io/yara/
BSD 3-Clause "New" or "Revised" License
8.33k stars 1.45k forks source link

Split YR_MAX_ATOM_LENGTH for string and re, and optimize the extract algorithm of regular atom to speed up the scan. #2106

Closed X3h1n closed 2 months ago

X3h1n commented 2 months ago

Split YR_MAX_ATOM_LENGTH to YR_MAX_ATOM_RE_LENGTH and YR_MAX_ATOM_LENGTH will speed up the scan. For string, its length could increase from 4 to more large e.g. 8, and this change can speed up the scan. For re, if we just increase the length of re atom, the number atoms may increases exponentially because of wildcards and the memory will explode , so we optimized re atom extract algorithm. On the one hand, increase string YR_MAX_ATOM_LENGTH to speed up the scan. On the other hand, control the selected re atom contains at most one wildcard. After finishing atom extract on line 1164 of atoms.c , we count the number of wildcards in the re atom. If the count above 1, we will further extract from atom, the following is the explain.

  1. find best child atom in every length from the atom, the range of length is [1, YR_MAX_ATOM_RE_LENGTH]. Every best child atom must meet 2 requirements, first it is the best quality, second the number of wildcards is below 1.
  2. select one highest quality atom that dost not contain any wildcard (no-wildcard atom) and save it in best_atom[0].
  3. select one highest quality atom that contains half wildcard ( half-wildcard atom) and save in best_atom[1]. For example, {01 0? 03} contains half wildcard.
  4. select one highest quality atom that contains one wildcard (one-wildcard) and save in best_atom[2].
  5. To avoid no-wildcard atom and half-wildcard atom length is too short, we set YR_MIN_VALID_ATOM_LENGTH. If no-wildcard atom length greater than or equal to YR_MIN_VALID_ATOM_LENGTH, we will select it as the final best atom and stop the select algorithm. Or else if the half-wildcard atom length greater than or equal to YR_MIN_VALID_ATOM_LENGTH+1, we will select it as the final best atom and stop the select algorithm. Or else we will choose the longest atom as the best atom. By following the above optimization, the scanning speed was increased by 6 times in our test.
google-cla[bot] commented 2 months ago

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.