dvyukov / go-fuzz

Randomized testing for Go
Apache License 2.0
4.79k stars 279 forks source link

Add -dict option (like AFL's -x) to replace low-signal string literal list #315

Open roddux opened 3 years ago

roddux commented 3 years ago

In recent testing I've found that the ROData.strLits list of literals can fill with useless noise; strings collected from places such as error messages, e.g.:

$ unzip metadata project-fuzz.zip
$ cat metadata | jq .Literals | rg 'invalid|error' | head -n5
    "Val": "crypto/aes: invalid key size ",
    "Val": "Error reading socket: %v",
    "Val": "http2: Transport conn %p received error from processing frame %v: %v",
    "Val": "invalid metric name",
    "Val": "RCodeNameError",
[...]

This list of literals is used directly by go-fuzz in the mutation logic, i.e.: https://github.com/dvyukov/go-fuzz/blob/6a8e9d1f2415cf672ddbe864c2d4092287b33a21/go-fuzz/mutator.go#L346-L367

Having lots of noise in strLits can therefore result in some fairly useless test cases, particularly for syntax-aware programs.

I propose this small change to add a -dict option, so that the user can manually supply a list of useful tokens to go-fuzz. This replaces the ROData.strLits tokens (built from the list in the metadata file) with a high-signal list that the user supplies.


Other thoughts

The signal of the built-in token list could perhaps be improved by modifying the code to avoid messages passed to functions such as log.Fatal or fmt.Print, etc. https://github.com/dvyukov/go-fuzz/blob/6a8e9d1f2415cf672ddbe864c2d4092287b33a21/go-fuzz-build/cover.go#L394

roddux commented 3 years ago

Looking at this again, I think this also addresses #174.

dvyukov commented 3 years ago

Overall I am a fan of scripting expert smartness and making it available to all users out-of-the box, rather then shifting the hard work onto every user. We could do better static analysis as you noted, intercept byte/string comparisons at runtime to build dynamic dictionary, etc. But as Josh noted, simplicity of this change bribes, so I guess I don't mind.

dvyukov commented 3 years ago

I'm torn on the format; I like how it's simple, but it's not hard to imagine newline characters being useful in literals. One sloppy option is to stay line-oriented, but apply strconv.Unquote if possible and if not, accept as-is. Then you can use a quoted string to get any literal you want in (including a literal that looks like a quoted string), while still having a simple form for everything else. What do you think?

Good point. Strictly speaking, the input format may be binary and one may want to include some magic binary sequences. Opportunistically trying strconv.Unquote may lead to some surprises for e.g.:

aaa
bbb
"foo"

where I literally want foo with quotes, but they will be silently stripped with no feedback...

I can think of using strconv.Unquote always (somewhat cumbersome for users), or supporting either current format, or json-encoded []string for better control. Is there any prior art in other fuzzers (AFL, LibFuzzer, hongfuzz)?

roddux commented 3 years ago

Thank you both for the feedback! I haven't forgotten about this PR - I'll find the time to work on this soon (hopefully within the next couple weeks).

josharian commented 3 years ago

Take all the time you need. :)