Clarification on Evaluation Data Discrepancies in FuzzBench

asanrocks commented 1 year ago

I am writing to discuss some challenges I have encountered while attempting to reproduce the results presented in your insightful paper. Specifically, I am facing difficulties in replicating the performance of existing fuzzers, as outlined in Table 3 of the original publication.

In your paper, a minor discrepancy of 0.9% is mentioned for the code coverage of AFL and AFL++ when evaluating the program libxml2. However, during my own experimentation, I have observed a significant advantage of AFL++ over AFL, with a difference of up to 5% in terms of code coverage. Additionally, I noticed a noteworthy disparity in the raw coverage data. The coverage obtained from Clang's source-based coverage analysis exceeds 10,000 branches, whereas the data provided in your paper indicates a coverage of approximately 6,000 branches.

Given that I followed the setup and methodology described in FuzzBench (see also https://www.fuzzbench.com/reports/2023-05-06-sample/index.html), I suspect that I might have misconfigured the compiler flags or encountered issues with edge measurement. I was wondering if you could kindly provide me with some clarification regarding the disparities in the evaluation data between your paper and the FuzzBench framework, for example:

The compiler version and flags used to compile the programs.
The command line arguments when invoking the fuzzer.
The measurement of the edge coverage.

I genuinely appreciate your expertise and would be grateful for any assistance you can offer. Thank you very much for your time and attention to this matter.

Tricker-z commented 1 year ago

Hi, I am pleased you can use the CoFuzz fuzzing tool. Here are the answers to your concerns.

The compiler version and flags used to compile the programs.

Please follow the instructions of the provided script to run readelf. Here we use the wllvm to generate the bitcode of the whole program for the subsequent instrumentation. Note that the wllvm can affect the edge coverage results due to different AFL instrumentation, and it should be the difference between 10,000 branches & 6,000 branches for libxml2.

The command line arguments when invoking the fuzzer.

We enable the parallel mode with two instances for AFL. Here the master fuzzer run the original AFL while the slave fuzzer only enables the havoc mutation stage. Previous work has indicated the power of the havoc stage and AFL++ runs havoc stage by default. Here you can run AFL with -d to only enable havoc stage and validate the performance.

The measurement of the edge coverage.

Here we adopt two ways with the same edge coverage results. 1) use the fuzzing bitmap, the edge coverage is equals the number of bytes < 255. 2) use the AFL built-in tool afl-showmap to calculate the edge coverage for each single seed and combine to get the total edge coverage.

We're also attempting to integrate CoFuzz into Fuzzbench for better evaluation, but there are some obstacles.

Please let me know if you have any further questions :)

asanrocks commented 1 year ago

Thank you for your response! I appreciate the information you provided, as it greatly helps in comprehending your setup and reproducing the results. Nevertheless, I still have a few minor concerns regarding the baseline experiment (not the CoFuzz experiment). Specifically, I am interested in understanding the comparison between existing symbolic execution techniques and the popular fuzzers.

Given the inherently complex nature of fuzzing research, where a fuzzer's performance can be influenced by numerous variables, I would greatly appreciate it if you could provide me with further details regarding the specific flags used during the compilation of the binary and the execution of the AFL++ experiment. This includes the instrumentation mode, optimization level, and any special environment flags employed.

Thank you once again for your assistance!

Tricker-z / CoFuzz

Clarification on Evaluation Data Discrepancies in FuzzBench #2