StringMatching can contain invalid UTF-8 byte sequences

elliot-u410 commented 1 year ago

This is either a bug or a feature. If you view a string generator as SliceOf(Byte).Transform(string) then it's correct and probably finds strange edge cases in functions that should be fixed. If instead you view the string generator as SliceOf(Rune).Transform(string) then it's producing strings that are not UTF-8 compatible. I haven't dug too deeply into this. I just know that I had to start building my own string generators for this reason.

elliot-u410 commented 1 year ago

As an example, this is what I ended up writing to generate random file paths:

func genFilePath() *rapid.Generator[string] {
    pathCharacter := rapid.Rune().Filter(func(r rune) bool { return !lo.Contains([]rune{filepath.Separator, 0}, r) })
    pathComponent := rapid.Transform(rapid.SliceOfN(pathCharacter, 0, 50), func(x []rune) string { return string(x) })
    return rapid.Transform(rapid.SliceOfN(pathComponent, 0, 20), func(x []string) string { return filepath.Join(x...) })
}

flyingmutant commented 1 year ago

Is there any reason you have not used String, StringOf or their variants? They both should always generate valid UTF-8 -- but valid UTF-8 can contain non-printable characters.

Rune should generate valid Unicode codepoints, too -- but not all of them are printable, of course.

elliot-u410 commented 1 year ago

I only encountered this with StringMatching. I assumed that was built on top of the other String generators but that may not be true.

elliot-u410 commented 1 year ago

Ok I constructed a test case that shows the differences I'm seeing. I still don't understand why this is happening and it's quite possible an error on my part:

func TestStringMatching(testing *testing.T) {
    rapid.Check(testing, func(t *rapid.T) {
        pathCharacter := rapid.Rune().Filter(func(r rune) bool { return !lo.Contains([]rune{filepath.Separator, 0}, r) })
        str := rapid.Transform(rapid.SliceOfN(pathCharacter, 0, 50), func(x []rune) string { return string(x) }).Draw(t, "file")
        assert.Assert(t, utf8.ValidString(str))
        dir := testing.TempDir()
        _, err := os.OpenFile(filepath.Join(dir, str), os.O_CREATE, 0700)
        assert.NilError(t, err)
    })

    rapid.Check(testing, func(t *rapid.T) {
        str := rapid.StringMatching("[^/\x00]+").Draw(t, "string")
        assert.Assert(t, utf8.ValidString(str))
        dir := testing.TempDir()
        _, err := os.OpenFile(filepath.Join(dir, str), os.O_CREATE, 0700)
        assert.NilError(t, err)
    })
}

This fails in the second scenario with an error like

assertion failed: error is not nil: open /var/folders/c6/zs7j5_n171g9vhw3sgkm0pzm0000gn/T/TestStringMatching955838286/1673/࢒: illegal byte sequence

flyingmutant commented 1 year ago

We can verify that StringMatching is behaving as expected:

func TestStringMatchingIssue47(t *testing.T) {
    Check(t, func(t *T) {
        s := StringMatching("[^/\x00]+").Draw(t, "s")
        if !utf8.ValidString(s) {
            t.Fatalf("not a valid UTF-8 string: %q", s)
        }
        if strings.ContainsAny(s, "/\x00") {
            t.Fatalf("%q contains / or NUL", s)
        }
    })
}

> go test -v -rapid.checks=10_000_000 -run=47
=== RUN   TestStringMatchingIssue47
    strings_external_test.go:22: [rapid] OK, passed 10000000 tests (21.841782396s)
--- PASS: TestStringMatchingIssue47 (21.84s)
PASS
ok      pgregory.net/rapid  21.846s

It is probably the case that os.OpenFile on your operating system has more restrictions on file names than any non-NUL non-slash UTF-8 bytes. If you run your first test long enough, it will probably fail, too -- I see no major difference in what they should generate.

elliot-u410 commented 1 year ago

Through some testing I can confirm that StringMatch has a much higher probability of producing unique samples than does the filtered-rune approach. Sampling 10,000,000 runes, the filtered-rune approach only produced 100,000 unique examples whereas StringMatch produced 800,000.

flyingmutant commented 1 year ago

Did you find out what are the restrictions on the valid filenames on your OS? Would be interesting to know.

elliot-u410 commented 1 year ago

I am on APFS and according to my research it allows all Unicode symbols (even / and NUL). According to some sources, some kernel APIs do not allow file names with those two symbols, hence why I filtered them out. I did try running as many tests against the filtered-rune approach with OpenFile. It never found a failure case but it crashed after about 30 minutes for an unrelated reason. However, based on my above test, it may be that I am just testing the same 100,000 runes over and over and not reaching the ones that would cause failure.

flyingmutant / rapid

StringMatching can contain invalid UTF-8 byte sequences #47