Open X3h1n opened 2 months ago
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).
View this failed invocation of the CLA check for more information.
For the most up to date status, view the checks section at the bottom of the pull request.
Did you measure the performance gain after these changes? In experiments I've done in the past I've concluded that increasing the maximum atom length beyond 4 or maybe 6 bytes doesn't increase, but the memory grows considerably.
Can you share details about the tests you did?
The rules used in the test consist of strings and opcode, most of which are strings, and there are about 10,000 rules. About 1.3 million samples were tested. Before modification, the average scan time was 0.594. With YR_MAX_ATOM_LENGTH set to 8 and YR_MAX_ATOM_RE_LENGTH set to 8, the average scan time is 0.095. In our tests, we found that increasing YR_MAX_ATOM_LENGTH has no significant effect on the string atom. But increasing the length of re atom increases memory dramatically. Therefore, we have modified the extraction algorithm of re atom to limit the extracted atom to contain a maximum of one wildcard. Thus, the length of re atom does not have to be YR_MAX_ATOM_RE_LENGTH, but it certainly does not have a significant impact on memory.
Because YR_MAX_ATOM_LENGTH limits the length of strings and re, if only YR_MAX_ATOM_LENGTH is increased, the re atom may contain multiple wildcards, and the expansion of the re atom may reduce the gain in scan speed, while the memory will increase substantially. In our experiment, we're actually just increasing the length of the string atom. For re atoms, only the length of re atoms that meet the requirements (contains a maximum of one wildcard) is increased, otherwise there is no change compared with before modification.
I still have some questions. You said that you have tested with a corpus of 10.000 rules, but most of them are plain strings and are not really affected by this change. Therefore there must be some regexps in that corpus which are the really slow ones, and those may have improved with you change.
I would like to take a look at those specific regexps that are slow with the current implementation and improve with your change, in order to understand better which type of regexps are benefitting from this change.
Also, notice that this change has broken some test cases.
Maybe my previous expression was not accurate. We increased the atom length of the string(YR_MAX_ATOM_LENGTH), so scan speed of the string rules will increase. Because most of my test rules are string rules, my scanning performance is relatively good. However, our changes also take into account the possibility of re atom exploding, so the re atom selection algorithm has been optimized to accommodate the increased length of re atom.
Split YR_MAX_ATOM_LENGTH to YR_MAX_ATOM_RE_LENGTH and YR_MAX_ATOM_LENGTH will speed up the scan. For string, its length could increase from 4 to more large e.g. 8, and this change can speed up the scan. For re, if we just increase the length of re atom, the number atoms may increases exponentially because of wildcards and the memory will explode , so we optimized re atom extract algorithm. On the one hand, increase string YR_MAX_ATOM_LENGTH to speed up the scan. On the other hand, control the selected re atom contains at most one wildcard. After finishing atom extract on line 1164 of atoms.c , we count the number of wildcards in the re atom. If the count above 1, we will further extract from atom, the following is the explain.
By following the above optimization, the scanning speed was increased by 6 times in our test.