BurntSushi / quickcheck

Automated property based testing for Rust (with shrinking).
The Unlicense
2.4k stars 149 forks source link

Document that String is ASCII alphanumerics-only, maybe have a way of generating more Unicode? #77

Closed shepmaster closed 9 years ago

shepmaster commented 9 years ago

Right now, Strings only have ASCII alphanumeric characters. I tried to use it to generate strings with spaces in it and was surprised when my code was never called :smiley_cat:

It would be cool if there were more of the Unicode set included. Even awesomer might be if we had some tuning knobs where we could say "more punctuation, less letters", perhaps using the Unicode character classes?

Thanks for the fun-to-use library!

BurntSushi commented 9 years ago

Yeah, this definitely needs to be fixed. I think the ASCII limitation has been there and nobody has had cause enough to fix it.

There's basically two concerns here. The first is ergonomics. It's really nice when your witnesses are limited to characters that are likely to be meaningful in your terminal (i.e., printable characters). The second is that, of course, strings outside the ASCII range should be tested. When I first wrote quickcheck, I didn't know how to balance these so I just chose to address the first concern.

It seems like the principle of least surprise should apply here: String should be fuzzed with the set of Unicode scalar values and some other type (perhaps defined in quickcheck, i.e., AsciiString) should be more limited when you want a guarantee of nicer witnesses. I think it could deref to &String, which should mostly provide the ergonomics of a plain String.

shepmaster commented 9 years ago

It seems like shrinking could also come into play. I could conceive of a string with only ascii as being "smaller" than one with ascii + ascii punctuation. Then it could "grow" to include more common unicode, then "grow" towards uncommon. Just thinking outloud, really.

bluss commented 9 years ago

Has this changed? I think I saw wildly different results than this.

Just as a data point, this is what I ended up using for a string search testing / fuzzing:

#[derive(Clone, Debug)]
struct Text(String);

//static ALPHABET: &'static str = "ABCD";
static ALPHABET: &'static str = "\0\u{1}\u{2}\u{3}";

impl Arbitrary for Text {
    fn arbitrary<G: qc::Gen>(g: &mut G) -> Self {
        let len = u16::arbitrary(g);
        let mut s = String::with_capacity(len as usize);
        for _ in 0..len {
            let i = usize::arbitrary(g);
            let i = i % ALPHABET.len();
            s.push(ALPHABET.as_bytes()[i] as char)
        }
        Text(s)
    }
    fn shrink(&self) -> Box<Iterator<Item=Self>> {
        Box::new(self.0.shrink().map(Text))
    }
}

Since the algorithm was working on just the byte slice, I switched to an alphabet of 0-4 for even easier to read intermediate debug outputs..

shepmaster commented 9 years ago

@bluss it appears so! This could be considered closed via 848acee19b180b44f7cf1d74ac431bd68efd8667.

BurntSushi commented 9 years ago

Errm, right, I forgot to update this issue, but I grew tired of the ASCII-only behavior so I just changed it.

ASCII-only is still useful though, so we might want an AsciiString type or something.