cksystemsgroup / monster

Monster is a symbolic execution engine for 64-bit RISC-U code
https://cksystemsgroup.github.io/monster
MIT License
10 stars 3 forks source link

Scoring function of rarity simulation is borked #114

Open mstarzinger opened 3 years ago

mstarzinger commented 3 years ago

The current version of the scoring function compute_scores seems to weigh non-rare states higher than rare ones. Instead of selecting states that explored rare value combinations, it prefers the ones where the value/address count is higher. Let us consider the following example input.

uint64_t main() {
  uint64_t  i;
  uint64_t  j;
  uint64_t* x;

  i = 0;
  j = 0;
  x = malloc(8);

  *x = 0;

  read(0, x, 1);

  if (*x < 25) {
    *x = 0;
    while (i < 150) {
      i = i + 1;
    }
    return 1;
  } else {
    *x = 0;
    while (j < 150) {
      j = j + 1;
    }
    return 0;
  }
}

In this example I expect the simulation to fall into the true-branch in about 10% of all runs. Hence incrementing the memory location of i will be rarer than incrementing j with a high probability. I'd expect rarity simulation to select states that are looping in the true-branch. Note that the looping value of 150 was chosen so that a state needs to survive and be selected for exactly one iteration to reach an exit path (assuming the default step-size of 1000).

This however does not happen (when using the default harmonic mean), see the following run. I also dump the scores vector for debugging purposes and to prove that at least one state did fall into true-branch (two in this case, the ones with the lower score):

$ time ./target/debug/monster rarity --states 10 --selection 3 --iterations 2 ./examples/select-rare.m
Running rarity simulation round 1...
Creating 10 new states (0 copied, 10 new)
Running engines (took 23.706977ms)
Remove 0 successfully exited states from selection
Scoring states (took 402.021139ms)
scores: [0.14409221902017258, 0.14727540500736333, 0.14727540500736333, 0.14409221902017258, 0.14727540500736333, 0.14727540500736333, 0.14727540500736333, 0.14727540500736333, 0.14727540500736333, 0.14727540500736333]
selecting rarest 3 states
Running rarity simulation round 2...
Creating 7 new states (4 copied, 3 new)
Running engines (took 14.214393ms)
Remove 7 successfully exited states from selection
Scoring states (took 126.964981ms)
scores: [0.013235294117646955, 0.013235294117646955, 0.01311953352769669]
selecting rarest 3 states
no bug found in binary

real    0m1,181s
user    0m1,091s
sys 0m0,084s

When running the same example with the non-default arithmetic mean, the rarer state is preferred, because score_with_mean returns Ordering::Less in this case. See the following run as an example. Again notice the single state with a lower score (which incidentally does get selected this time):

$ time ./target/debug/monster rarity --states 10 --selection 3 --iterations 2 --mean arithmetic ./examples/select-rare.m
Running rarity simulation round 1...
Creating 10 new states (0 copied, 10 new)
Running engines (took 24.153033ms)
Remove 0 successfully exited states from selection
Scoring states (took 403.984164ms)
scores: [1065.6, 1065.6, 1065.6, 1062.4, 1065.6, 1065.6, 1065.6, 1065.6, 1065.6, 1065.6]
selecting rarest 3 states
Running rarity simulation round 2...
Creating 7 new states (4 copied, 3 new)
bug found:
exit code greater than zero
concrete inputs read: [21]
pc: 0x1005c

real    0m1,050s
user    0m0,971s
sys 0m0,076s

Even though the arithmetic mean works out in this example, I still do believe both variants to be somewhat faulty. Here is my understanding of which scoring functions are currently implemented:

Here is what I think would be a better alternative for what we could try:

Note that all three proposed scores use the intuitive Ordering::Greater because higher scores represent rarer states in all three cases. The inverse was taken on the individual counter values before mean calculation, not by flipping the interpretation of the mean. I do believe this makes a difference in semantics. And it also makes interpretation of the score uniform, higher is rarer, higher is better.

WDYT about this alternative? Or am I completely off base here? All comments welcome!

ChristianMoesl commented 3 years ago

Thanks for your input! Very unfortunate, that you have to deal with those bugs.

You are right, n for mean computation has to be the amount of counts. No idea why that was programmed that way. And the ordering of the scores should be identical for both means (less is always better). That bug is on me 😄.

Anyway, now that this is fixed, it's time to talk :)

I think it's hard to get good results for analysing the scoring algorithm with your code example. Keep in mind, that we compute the scores based on all memory of the process, the registers and the program counter. So there are a lot of (pseudo) nondeterministic values in there.

To get a discussion going solely based on the different variants of combining the counts to a score, I made a little playground to compare both implemented means. (https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=fcde915148d6240bb2a0bb51566f7372) I tried to show a case, where it actually makes a difference when we compute harmonic/arithmetic means.

The basic idea behind all of that is: For the arithmetic mean it's irrelevant, if 2 states have 5 equal values or 5 states have 2 equal values, right? For both situations we would get 5 as score for every state.

But on the other side, because 1/x is a non-linear function, we get an interesting effect when computing the harmonic mean. It is essentially way worse to have values shared with many states (2 / (1/5 + 1/5) = 5), and better to have many values shared with few states (5 / (1/2 + 1/2 + 1/2 + 1/2 + 1/2) = 2).

On your idea with sum: Interesting, but I think it's essentially the same thing as the arithmetic mean, but worse in a way. Given, that all states have an equal amount of values to score: Then the arithmetic mean and the sum are basically the same thing (linear function with different constant slope). If states have different amount of values to score: Then those states with more values, have probably a higher chance to have bad (high) scores. And this point makes it kind of worse for me, because I do not want to punish states with more values (= more memory allocated/touched).

That's my reasoning behind all of that. Hope that makes sense. 😅

@ckirsch do you also have some thoughts on that? Also Marcell @Blazefrost: Ideas? Comments?

ckirsch commented 3 years ago

We are competing with traditional fuzzers and symbolic execution (SE) engines. Keep that in mind.

We are probably able to explore more paths than SE but need guidance which in turn needs to be better than what fuzzers are doing. Considering whole machine states is the promise here.

At this stage of the project I suggest we implement what others propose (rarity) as well as other metrics that you think make sense on source code level.

Ultimately, we need to demonstrate that we can find more bugs faster than any fuzzers and SE.

mstarzinger commented 3 years ago

@ChristianMoesl: Thanks for the various fixes! Awesome! Those actually take care of most of the original issues I stumbled over in the scoring function. What remains is just the discussion of the semantic details.

For the sake of completeness I have added my proposed scoring functions to your playground. Please see this extended playground for the implementation. I want to emphasize again that I believe sum to be a faithful unaltered version of the scoring function described in the paper by Brayton et al. which states: "Now each next state reached can be characterized by a weight equal to the sum of inverse values of the counters corresponding to specific flop group values appearing in this state."

Yes, I totally do agree that the non-linear nature of the inverse (1/x) does make a big difference and makes the harmonic mean interesting. But the same argument holds for taking the inverse of the counters in general, as the paper seems to suggest. This way already the "sum of inverse counter values" exhibits the same interesting effect you described for harmonic above. One could say that this effect is already built into sum and arithmetic* as well.

And yes, I agree that sum and arithmetic* are virtually equivalent when the number of available counters is the same in all states. All that arithmetic* does is normalize larger states with respect to smaller states in case state sizes are different. I do not (yet) have an opinion/intuition how important such a normalization will be.

To try to summarize my argument: As long as we agree that sum is indeed a faithful implementation of what is described in the paper, I do believe we should add it to our portfolio just so that we are able to compare against it.

mstarzinger commented 3 years ago

One side-note I forgot to mentioned are the following two equivalences:

So in essence my proposal includes the two scoring functions already implemented, it does not alter their relative ordering either, it just inverts the interpretation of the ordering. But the proposed sum is new in the sense that we don't have an implementation of it yet.

ChristianMoesl commented 3 years ago

@mstarzinger I got it! You are right in that, that we can implement that and measure that. But we should probably have a good idea, why that would be interesting.

I think what you propose with sum is actually a denormalised version of the harmonic mean, right? We could therefore also implement a denormalised version of the arithmetic mean, which is the sum of all counts.

In general I like the idea of biasing scores based on the amount of counts. But I would intuitivly like to positively bias the states with more counts (more memory touched). Because I think, if more memory is touched, the program is probably way deeper in the execution path, right? The total amount of words touched is monotonically increasing on one execution path.