Open mstarzinger opened 3 years ago
Thanks for your input! Very unfortunate, that you have to deal with those bugs.
You are right, n
for mean computation has to be the amount of counts. No idea why that was programmed that way.
And the ordering of the scores should be identical for both means (less is always better). That bug is on me 😄.
Anyway, now that this is fixed, it's time to talk :)
I think it's hard to get good results for analysing the scoring algorithm with your code example. Keep in mind, that we compute the scores based on all memory of the process, the registers and the program counter. So there are a lot of (pseudo) nondeterministic values in there.
To get a discussion going solely based on the different variants of combining the counts to a score, I made a little playground to compare both implemented means. (https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fcde915148d6240bb2a0bb51566f7372) I tried to show a case, where it actually makes a difference when we compute harmonic/arithmetic means.
The basic idea behind all of that is: For the arithmetic mean it's irrelevant, if 2 states have 5 equal values or 5 states have 2 equal values, right? For both situations we would get 5 as score for every state.
But on the other side, because 1/x
is a non-linear function, we get an interesting effect when computing the harmonic mean. It is essentially way worse to have values shared with many states (2 / (1/5 + 1/5) = 5), and better to have many values shared with few states (5 / (1/2 + 1/2 + 1/2 + 1/2 + 1/2) = 2).
On your idea with sum
: Interesting, but I think it's essentially the same thing as the arithmetic mean, but worse in a way. Given, that all states have an equal amount of values to score: Then the arithmetic mean and the sum are basically the same thing (linear function with different constant slope).
If states have different amount of values to score: Then those states with more values, have probably a higher chance to have bad (high) scores. And this point makes it kind of worse for me, because I do not want to punish states with more values (= more memory allocated/touched).
That's my reasoning behind all of that. Hope that makes sense. 😅
@ckirsch do you also have some thoughts on that? Also Marcell @Blazefrost: Ideas? Comments?
We are competing with traditional fuzzers and symbolic execution (SE) engines. Keep that in mind.
We are probably able to explore more paths than SE but need guidance which in turn needs to be better than what fuzzers are doing. Considering whole machine states is the promise here.
At this stage of the project I suggest we implement what others propose (rarity) as well as other metrics that you think make sense on source code level.
Ultimately, we need to demonstrate that we can find more bugs faster than any fuzzers and SE.
@ChristianMoesl: Thanks for the various fixes! Awesome! Those actually take care of most of the original issues I stumbled over in the scoring function. What remains is just the discussion of the semantic details.
For the sake of completeness I have added my proposed scoring functions to your playground. Please see this extended playground for the implementation. I want to emphasize again that I believe sum
to be a faithful unaltered version of the scoring function described in the paper by Brayton et al. which states: "Now each next state reached can be characterized by a weight equal to the sum of inverse values of the counters corresponding to specific flop group values appearing in this state."
Yes, I totally do agree that the non-linear nature of the inverse (1/x) does make a big difference and makes the harmonic
mean interesting. But the same argument holds for taking the inverse of the counters in general, as the paper seems to suggest. This way already the "sum of inverse counter values" exhibits the same interesting effect you described for harmonic
above. One could say that this effect is already built into sum
and arithmetic*
as well.
And yes, I agree that sum
and arithmetic*
are virtually equivalent when the number of available counters is the same in all states. All that arithmetic*
does is normalize larger states with respect to smaller states in case state sizes are different. I do not (yet) have an opinion/intuition how important such a normalization will be.
To try to summarize my argument: As long as we agree that sum
is indeed a faithful implementation of what is described in the paper, I do believe we should add it to our portfolio just so that we are able to compare against it.
One side-note I forgot to mentioned are the following two equivalences:
arithmetic*(_) == 1 / harmonic(_)
harmonic*(_) == 1 / arithmetic(_)
So in essence my proposal includes the two scoring functions already implemented, it does not alter their relative ordering either, it just inverts the interpretation of the ordering. But the proposed sum
is new in the sense that we don't have an implementation of it yet.
@mstarzinger I got it! You are right in that, that we can implement that and measure that. But we should probably have a good idea, why that would be interesting.
I think what you propose with sum
is actually a denormalised version of the harmonic mean, right? We could therefore also implement a denormalised version of the arithmetic mean, which is the sum of all counts.
In general I like the idea of biasing scores based on the amount of counts. But I would intuitivly like to positively bias the states with more counts (more memory touched). Because I think, if more memory is touched, the program is probably way deeper in the execution path, right? The total amount of words touched is monotonically increasing on one execution path.
The current version of the scoring function
compute_scores
seems to weigh non-rare states higher than rare ones. Instead of selecting states that explored rare value combinations, it prefers the ones where the value/address count is higher. Let us consider the following example input.In this example I expect the simulation to fall into the true-branch in about 10% of all runs. Hence incrementing the memory location of
i
will be rarer than incrementingj
with a high probability. I'd expect rarity simulation to select states that are looping in the true-branch. Note that the looping value of 150 was chosen so that a state needs to survive and be selected for exactly one iteration to reach an exit path (assuming the default step-size of 1000).This however does not happen (when using the default
harmonic
mean), see the following run. I also dump the scores vector for debugging purposes and to prove that at least one state did fall into true-branch (two in this case, the ones with the lower score):When running the same example with the non-default
arithmetic
mean, the rarer state is preferred, becausescore_with_mean
returnsOrdering::Less
in this case. See the following run as an example. Again notice the single state with a lower score (which incidentally does get selected this time):Even though the
arithmetic
mean works out in this example, I still do believe both variants to be somewhat faulty. Here is my understanding of which scoring functions are currently implemented:harmonic
: The harmonic mean over all (non-inverse) counters for a state, but withn
being the number of states instead of the number of counters in the state.arithmetic
: The arithmetic mean over all (non-inverse) counters for a state, but again withn
being the number of states instead of the number of counters in the state. This time usingOrdering::Less
which somewhat hides the issue.Here is what I think would be a better alternative for what we could try:
sum
: The sum of inverse(!) counter values, no mean calculation. I do believe this is the scoring function described in the paper originally, we should at least add it to our portfolio I think, even if we decide not to make if the default.arithmetic*
: The arithmetic mean of the inverse(!) counter values. But importantly withn
being the number of counters in the given state, otherwise we are just dividing by a constant and not normalizing with respect to the number of counters (i.e. number of memory locations with a concrete value) in each individual state.harmonic*
: The harmonic mean of the inverse(!) counter values. Again withn
set like above for the same reason.Note that all three proposed scores use the intuitive
Ordering::Greater
because higher scores represent rarer states in all three cases. The inverse was taken on the individual counter values before mean calculation, not by flipping the interpretation of the mean. I do believe this makes a difference in semantics. And it also makes interpretation of the score uniform, higher is rarer, higher is better.WDYT about this alternative? Or am I completely off base here? All comments welcome!