Dongdongshe / K-Scheduler

A universal seed scheduler for fuzzers (LibFuzzer and AFL havoc mode) and concolic execution engine (qsym).
MIT License
113 stars 19 forks source link

The problem of edge coverage result deviation #11

Closed egoterm closed 1 year ago

egoterm commented 1 year ago

Hi.I successfully ran k-scheduler on the server and did some preliminary experiments. However I have a question about the results. 1) Results: The coverage performance of k-scheduler on some programs is not very good. As shown in the figure below, I counted the edge coverage results obtained by running AFL and k-scheduler on nm-new and readelf for 24 hours. I repeated the experiment for 20 times. However, I found that the results of k-scheduler and afl on nm-new program are comparable, but on readelf, the results obtained by k-scheduler are quite different from afl.

edge coverage

2) Question: I want to know what is the reason for this phenomenon? First of all, I know that AFL itself is advanced enough, so it is impossible to require a fuzzer to perform better than afl in all programs or scenarios. Besides, nm-new and readelf are compiled by the same version of binutils. I don't quite understand the difference between the edge coverage results of k-schdeduler on readelf and the edge coverage results of k-scheduler on other programs. Because I also tested some other programs, the edge coverage of k-scheduler on some programs is higher than AFL, but the result of k-scheduler on readelf makes me unable to understand. 3) Guess: I suspected at first that this result with a large deviation was caused by my misoperation. But when I use k-scheduler to fuzz 16 target programs, the configuration and command options of k-scheduler are the same. My running command and configuration are as follows:

command: 

aft-fuzz command:

./afl-fuzz_kscheduler -i AFL/testcases/others/elf/ -o res/24/readelf/kscheduler_20 -t 2000 -m none -d  testcases/kscheduler/binutils-2.38/binutils/readelf -a @@
./afl-fuzz_kscheduler -i AFL/testcases/others/elf/ -o res/24/nm-new/kscheduler_20 -t 2000 -m none -d  testcases/kscheduler/binutils-2.38/binutils/nm-new -C @@

python command: 

python3 gen_dyn_weight.py

The folder of my k-scheduler is shown in the figure below. Each folder has an afl-fuzz-kscheduler and gen_dyn_weight.py, in addition to the tested target program and the image file of the target program.

folder

My server has a total of 100 logical cores, and I allocated 40 cores to the fuzzing task. When I run k-scheduler, I first start the gen_dyn_weight script, and then start the fuzz process:

python

So I would like to ask did you have encountered similar problems during the experiment? How did you solve it? What is the cause of this problem? Is it my configuration problem? Or does the system environment, such as the number of fd already opened in the system, affect the results of k-scheduler?

Dongdongshe commented 1 year ago

Thanks for your interest in K-Scheduler. I have been busy with interviews recently, and I will be available around mid-January.

There seems to be a misconfiguration of the K-Scheduler in your experiments. We evaluated the k-scheduler on readelf and nm before, and the results were good.

  1. From your evaluation results, AFL havoc mode achieved more than 10k edges within the first hour. But K-Scheduler + AFL havoc mode only got a 6369 edge after 24 hours. I am guessing you might use two different readelf binaries. To ensure a fair and scientific comparison, please use the same binary compiled in K-Scheduler on vanilla AFL havoc mode.
  2. Could you provide the Binutils source code, llvm version, wllvm version, and your final binary, along with graph info data, so I can reproduce your results and debug the configuration? I'll make sure to get back to you when I am free.
  3. Could you also provide your baseline setting, including vanilla AFL, binary, and seed corpora?
  4. You added additional parameters "-t 2000" to K-Scheduler. Could you explain why you added that timeout parameter and what would happen if you removed the timeout parameters?
  5. You mentioned you tested 16 programs. How many programs have a similar issue (i.e., K-Scheduler significantly underperforms baseline), and how many programs have expected results (i.e., K-Scheduler can beat vanilla AFL havoc)?

Best, Dongdong

Dongdongshe commented 1 year ago

Resolved. I confirmed with egoterm through email that the wrong configuration (two different binaries) caused this issue.