fakedata-haskell / fakedata

Haskell Library for producing quality fake data
BSD 3-Clause "New" or "Revised" License
149 stars 22 forks source link

Repetitive results when generating #46

Closed sergesku closed 2 years ago

sergesku commented 2 years ago

As a result of the specific control of the internal StdGen, duplicate values can be generated in some cases

λ> generateNonDeterministic ((,,,) <$> pickAny <*> pickAny <*> pickAny <*> pickAny) :: IO (Int, Int, Int, Int)
(-390874233778619447,3346224903095245425,3346224903095245425,3346224903095245425)

generateNonDeterministic $ do
    lst1 <- replicateM 3 pickAny
    lst2 <- replicateM 3 pickAny
    lst3 <- replicateM 3 pickAny
    pure (lst1,lst2,lst3)

( [6327389221239573927,-7430060028677694428,507465317709549893]
, [-7430060028677694428,507465317709549893,5136949682318167298]
, [507465317709549893,5136949682318167298,-6349452383461238050]
)
psibi commented 2 years ago

I would be happy to accept a PR.

ramirez7 commented 2 years ago

generate results in non-random results entirely:

λ generate  ((,,) <$> pickAny @Int <*> pickAny @Int <*> pickAny @Int)
(8630743491499986047,8630743491499986047,8630743491499986047)

Looking at the source, the rng isn't threaded through:

pickAny :: forall a m. (Monad m, Random a) => FakeT m a
pickAny = FakeT (\settings -> let (x, _) = random (getRandomGen settings) in pure x)

So every call to pickAny is always going to result in the same a for deterministic mode.

psibi commented 2 years ago

generate results in non-random results entirely:

By default, the values generated are deterministic.

psibi commented 2 years ago

I think explicity documenting it in Haddock would be a good thing: https://hackage.haskell.org/package/fakedata-1.0.2/docs/Faker.html#v:generate

ramirez7 commented 2 years ago

Should that be what "deterministic" means here though? Fake data that is always the same doesn't seem useful. I don't see why anyone would ever use generate as-is. Even in my example, why would I want all my pickAnys in my applicative composition to be the same? I'd want them to be different - but the same every time I ran with deterministic mode set to "on".

It would still be "deterministic" if it uses the given StdGen (which is based on a seed). But it would be deterministically random (i.e. it generates the same random value for the given seed) instead of not random at all.

So I think the seed should be threaded through (making it an IORef makes this possible).

psibi commented 2 years ago

IIRC, It was made like that for perf reasons. Constructing individual record types didn't need StdGen to be changed as the fields within it were different.

psibi commented 2 years ago

I gave this a much better look today. Using applicative instance doesn't change the underlying stdgen (I think I did it because if not it will break applicative laws, but if someone can prove otherwise - I would be happy to accept a PR in the form of quickcheck-classes)

I will see if I can add tests for all the properties using quickcheck-classes, but my OSS time is limited these days.

So how to produce a different set of numbers using pickAny:

Prelude Faker Faker.Combinators> generateNonDeterministic $ sequence [pickAny @Int, pickAny @Int, pickAny @Int]
[-7259296580907384170,3667397267649137371,4885350613124653216]
Prelude Faker Faker.Combinators> generateNonDeterministic $ sequence [pickAny @Int, pickAny @Int, pickAny @Int]
[-4133709365272433231,-4842076987059227700,6130434414512083199]

And I would write your second version of code like this to produce the intended effect:

lst1 <- generateNonDeterministic $ replicateM 3 pickAny
lst2 <- generateNonDeterministic $ replicateM 3 pickAny
lst3 <- generateNonDeterministic $ replicateM 3 pickAny

Or it can be made even more shorter:

generateNonDeterministic $ sequence [ replicateM 3 pickAny, replicateM 3 pickAny, replicateM 3 pickAny]
sergesku commented 2 years ago

This method will not work to create absolutely different values.

generateNonDeterministic $ sequence [ replicateM 3 pickAny, replicateM 3 pickAny, replicateM 3 pickAny] :: IO [[Int]]
[ [-6647855527820622346,-1896711671625956513,-8882849823799143032]
, [-1896711671625956513,-8882849823799143032,-4477016760604416600]
, [-8882849823799143032,-4477016760604416600,6599046923786404784]]

FakeT is only really a reader internally, so it can't replicate the full power of state.

psibi commented 2 years ago

Ah, right. In that case these are the two options:

lst1 <- generateNonDeterministic $ replicateM 3 pickAny
lst2 <- generateNonDeterministic $ replicateM 3 pickAny
lst3 <- generateNonDeterministic $ replicateM 3 pickAny

Or in this specific case:

generateNonDeterministic $ replicateM 12 pickAny
Lambda-Logan commented 2 years ago

I think this is a good solution for lists: https://github.com/fakedata-haskell/fakedata/pull/47

A more general solution to thread a modified StdGen through a <*> would be very sexy though.

Lambda-Logan commented 2 years ago

How about the following compromise?

Have a separate version of apply: variedApply ... It would be slower but produce varied results. We could also flip the angle brackets 90 degrees to have a corresponding infix: v*v

Then we save on perf when it's not needed and don't break any laws

Lambda-Logan commented 2 years ago

As for the other combinators, we could have a separate twin module: Combinators.Varied

psibi commented 2 years ago

A more general solution to thread a modified StdGen through a <*> would be very sexy though.

I would be happy to take a PR for it with also additional test cases that application laws won't break: https://hackage.haskell.org/package/quickcheck-classes-0.6.5.0/docs/Test-QuickCheck-Classes.html#v:applicativeLaws

psibi commented 2 years ago

I have release a new version in the hackage with improved docs. Also added a section on README that might be helpful: https://github.com/fakedata-haskell/fakedata#deterministic-vs-non-deterministic-values

We also have a new function named generateNonDeterminsicWithFixedSeed in the newer releaese: https://github.com/fakedata-haskell/fakedata/blob/master/ChangeLog.md#103

With that I'm closing the issue. But free feel to re-open if there are any more questions on this.