cksac / fake-rs

A library for generating fake data in Rust.
Apache License 2.0
883 stars 87 forks source link

Getting same words or names most of the time #122

Closed ghmendonca closed 7 months ago

ghmendonca commented 1 year ago

I noticed that almost every time I get the same word or same name, how can I make this much more random?

I'm running some tests in my API and I'm connecting to a real database, and I have 52 documents and pretty much everytime that I run the tests it rejects since some fields should be unique and the words or names generated is already in the database

tgross35 commented 7 months ago

One option is to add a Unique<T> wrapper, such as Unique<Username<..>>, that contains a HashSet of all produced values, and rerolls the inner generator if there are duplicates. The hashset would either need to be in a RefCell/Mutex, or the trait signatures updated to take &mut. Unique versions of fake::vec! and others could also be added that do this faster.

The other option that is much more performant is to use a linear congruential generator to select from the dictionary. These generators are pseudorandom but can guarantee there are no repeated values within the dictionary size. This seems like a better option, if possible to implement. This could use a different interface since a RNG is not needed for getting items, only for selecting the initial seed.

There is some discussion and linked issues for python's FactoryBoy https://github.com/FactoryBoy/factory_boy/issues/305, I think their initial implementation uses a set.

proegssilb commented 7 months ago

I'm also doing some DB testing that needs certain fields to be unique. Guess I'll have to generate & load the data in Python for now (since it has a library that can do performant & unique data gen), and benchmark querying in rust (for sub-ms precision).

Maybe if there's maintainer interest in a particular design for this, I could look into a PR, but for now, the path of least resistance lies elsewhere.

tgross35 commented 7 months ago

Figure it's worth an ask - @cksac do you have any ideas here?

cksac commented 7 months ago

I think it would better to have a custom faker for the field required to be unique, like below

use fake::{Dummy, Fake};
use once_cell::sync::Lazy;
use std::{collections::HashSet, sync::Mutex};

static ORDER_ID_CACHE: Lazy<Mutex<HashSet<usize>>> = Lazy::new(|| Mutex::new(HashSet::new()));

pub struct OrderIdFaker<U>(pub U);

impl<U> Dummy<OrderIdFaker<U>> for usize
where
    usize: Dummy<U>,
{
    fn dummy_with_rng<R: rand::prelude::Rng + ?Sized>(
        config: &OrderIdFaker<U>,
        rng: &mut R,
    ) -> Self {
        let faker = &config.0;
        let mut id = faker.fake_with_rng(rng);
        let mut cache = ORDER_ID_CACHE.lock().unwrap();
        while cache.contains(&id) {
            id = faker.fake_with_rng(rng);
        }
        cache.insert(id);
        id
    }
}

#[derive(Debug, Dummy)]
pub struct Order {
    #[dummy(faker = "OrderIdFaker(0..1000)")]
    id: usize,
}

fn main() {
    let orders = fake::vec![Order; 1..10];
    println!("{:?}", orders);
}

Unique<T> wrapper will not work here as

  1. Can't implement Dummy for it due to implementation overlapping within the fake crate.
  2. Can't get a global cache of type T in dummy_with_rng fn
  3. Can't support different cache in same target type
    pub struct Unique<T>(T);
    impl<U, T> Dummy<Unique<T>> for U where U: Dummy<T> {
    ....
    }
proegssilb commented 7 months ago

Not sure if I'm on the same page as you regarding technical limitations, not sure this note will be helpful, but it is worth mentioning for the record if nothing else.

If I have this schema (python code):

schema_fun = lambda: {
        "username": field("person.username"),
        "pwd": "password",
        "name": field("full_name"),
        "email": field("person.email", unique=True),
        "created": field("timestamp", fmt=TimestampFormat.POSIX),
        "verified": field("timestamp", fmt=TimestampFormat.POSIX),
        "modified": field("timestamp", fmt=TimestampFormat.POSIX),
    }

Suppose internally, the thing that generates person.username would get re-used between the schema fields username and email. I don't actually need the internally-generated person.username to be unique between those two fields. I just need the same email address to not be generated twice.

(In practice, I actually had to drop the username field because I couldn't find the spot in the docs where mimesis provides unique usernames. Just unique emails.)

All that to say locally unique outputs is, in fact, a useful start, and would solve problems.

--

Another thought: Suppose we had both (1) UniqueFromArray, that "shuffled" an array, and pulled each item at most once, and (2) UniqueFromArrays, that picked from multiple arrays and combined them according to a lambda (but only returned each combo once). That'd probably be a good start. Locally-unique-only, doesn't support the normal APIs, takes some serious hacking to do, but at least it enables problems to be solved without devising a custom algorithm from scratch to spit out each unique field individually. Like, email address would require manually combining First Name, Last Name, and Free Email Domain (or Lorem Ipsum Word + Lorem Ipsum Word + TLD for additional options). But that's far more approachable than having to write the same set-membership-check every time, or having to devise a custom linear congruential sequence every time.

cksac commented 7 months ago

hi @proegssilb, that is what I propose in previous suggestion. In below example, email is unique among generated user profile instances and not related to the username. And your proposed approach can be implemented in different faker if you like.

use fake::faker::internet::en::*;
use fake::locales::EN;
use fake::{Dummy, Fake};
use once_cell::sync::Lazy;
use std::{collections::HashSet, sync::Mutex};

static EMAIL_CACHE: Lazy<Mutex<HashSet<String>>> = Lazy::new(|| Mutex::new(HashSet::new()));

pub struct UniqueEmailFaker;

impl Dummy<UniqueEmailFaker> for String {
    fn dummy_with_rng<R: rand::prelude::Rng + ?Sized>(
        config: &UniqueEmailFaker,
        rng: &mut R,
    ) -> Self {
        let mut email: String = FreeEmail().fake_with_rng(rng);
        let mut cache = EMAIL_CACHE.lock().unwrap();
        while cache.contains(&email) {
            email = FreeEmail().fake_with_rng(rng);
        }
        cache.insert(email.clone());
        email
    }
}

#[derive(Debug, Dummy)]
pub struct UserProfile {
    #[dummy(faker = "Username()")]
    pub username: String,
    #[dummy(faker = "UniqueEmailFaker")]
    pub email: String,
}

fn main() {
    let user_set_1 = fake::vec![UserProfile; 1..10];
    println!("{:?}", user_set_1);

    let user_set_2 = fake::vec![UserProfile; 1..10];
    println!("{:?}", user_set_2);

    // no duplicate emails among user_set_1 and user_set_2, unless EMAIL_CACHE is cleared
}