bchavez / Bogus

:card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.
Other
8.66k stars 495 forks source link

Duplicates within the Lorem database #395

Closed qubitz closed 2 years ago

qubitz commented 2 years ago

Version Information

Software Version(s)
Bogus NuGet Package 33.0.2
.NET 5.0
Windows OS? 10, Version 1909
IDE Rider

What locale are you using with Bogus?

The default locale. I've never specified the locale.

What is the expected behavior?

new Bogus.DatasSets.Lorem.Words(15) to not produce duplicates. That being said, I didn't see any guarantees in the documentation and this is most likely an assumption.

What is the actual behavior?

I am seeing duplicate words produced, sometimes even adjacent to one another. For example, the seed of 1177614182 produces the sequence

facere, sequi, ea, nam, quia, voluptas, voluptas, dicta, recusandae, atque, nemo, sed, ut, consequatur, inventore
                              ^^^^^^^^  ^^^^^^^^

How do you reproduce the issue?

So far I found two seeds that produce duplicates within the first 15 elements: 1177614182 and 1283823404. I'm sure there's more, these are just some I stumbled upon.

Do you have a unit test that can demonstrate the bug?

[Fact]
public void DoesNotProduceDuplicates()
{
    var l = new Bogus.DataSets.Lorem()
    {
        Random = new Randomizer(1177614182),
    };

    l.Words(15).Should().OnlyHaveUniqueItems();
}

If the bug is confirmed, would you be willing to submit a PR?

Yeah sure, but I would need be pointed in the right direction

bchavez commented 2 years ago

Bogus does not guarantee uniqueness. You'll have to figure out what uniqueness means to you and your application based on your specific needs. There are fundamental mathematical limits and characteristics of pseudo-random number generators that make uniqueness almost impossible to attain without some kind of repetition. Also, you can find more information here: https://github.com/bchavez/Bogus/issues/251#issuecomment-526354033

Your only real practical solution in O(n) runtime and space is:

void Main()
{
   var f = new Faker(){ Random = new Randomizer(1177614182) };
   Enumerable.Range(0, 7).Select(_ => f.Lorem.GetUniqueWord()).Dump();
}

public static class ExtensionsForBogus
{
   private static ulong UniqueWordCounter = 0;

   public static string GetUniqueWord(this Lorem dataset)
   {
      return $"{dataset.Word()}{UniqueWordCounter++}";
   }
}

image

qubitz commented 2 years ago

I only searched issues for "duplicates" not "unique" 🤦‍♂️. Thanks for the blazing fast response and references.