bchavez / Bogus

:card_index: A simple fake data generator for C#, F#, and VB.NET. Based on and ported from the famed faker.js.
Other
8.5k stars 482 forks source link

Strategies for relationships and foreign keys? #163

Closed VictorioBerra closed 5 years ago

VictorioBerra commented 5 years ago

I struggle with this a lot. Many times I have relationships which might be one-to-one or one-to-many which means I might have up to three entities in a circular reference.

Usually, EF can figure this out IE:

var new Dog()
{
    Name = "Floof",
    Breed = new Breed() // One to one
    {
        Name = "Poodle"
    }
}

EFCore will create Breed, set the Breed.CatId, and off we go. With Bogus just replace breed with a faker instance and it will be converted and created. But sometimes this is more difficult with complex relationships.

Here is an example I came up with using the new EF Core HasData seeing stuff.

protected override void OnModelCreating(ModelBuilder modelBuilder)
{
    var numToSeed = 40;

    var breedIds = 1;
    var colorIds = 1;
    var dogIds = 1;
    var catIds = 1;
    var catColorLineIds = 1;

    modelBuilder.Entity<Color>().HasData(
        new Faker<Color>().StrictMode(true)
        .RuleFor(d => d.Id, f => colorIds++)
        .RuleFor(d => d.ColorName, f => f.Internet.Color())
        .Generate(numToSeed).ToArray());

    modelBuilder.Entity<Breed>().HasData(
        new Faker<Breed>().StrictMode(true)
        .RuleFor(d => d.Id, f => breedIds++)
        .RuleFor(d => d.BreedName, f => f.Name.FirstName())
        .Generate(numToSeed).ToArray());

    // Shows a one-to-one relationship with Breed
    // this would probably be better as CatBree and DogBreed but this is just an example to show a "lookup table"
    modelBuilder.Entity<Cat>().HasData(
        new Faker<Cat>().StrictMode(false)
        .RuleFor(d => d.Id, f => catIds++)
        .RuleFor(d => d.Name, (f, u) => f.Name.FirstName())
        .RuleFor(d => d.MeowLoudness, f => f.Random.Number(1, 10))
        .RuleFor(d => d.TailLength, f => f.Random.Number(1, 10))
        .RuleFor(d => d.BreedId, f => f.Random.Number(1, breedIds - 1)) // This associates a random cat with a random breed. Pretty neat.
        .Generate(numToSeed).ToArray());

    modelBuilder.Entity<Dog>().HasData(
        new Faker<Dog>().StrictMode(false)
        .RuleFor(d => d.Id, f => dogIds++)
        .RuleFor(d => d.Name, (f, u) => f.Internet.UserName())
        .RuleFor(d => d.BarkLoudness, f => f.Random.Number(1, 10))
        .RuleFor(d => d.TailLength, f => f.Random.Number(1, 10))
        .RuleFor(d => d.BreedId, f => f.Random.Number(1, breedIds - 1))
        .Generate(numToSeed).ToArray());

    // Shows a many-to-many : cat-cat_color_line-color
    modelBuilder.Entity<ColorCatLine>().HasData(
        new Faker<ColorCatLine>().StrictMode(false)
        .RuleFor(d => d.Id, f => catColorLineIds++)
        .RuleFor(d => d.CatId, f => f.Random.Number(1, catIds - 1))
        .RuleFor(d => d.ColorId, f => f.Random.Number(1, colorIds - 1))
        .Generate(numToSeed).ToArray());

}

Hope this helps someone and I would love to see how other people are solving this issue. Some thoughts on my mind on improvements:

How can we use HasData in our unit/integration tests? Typically you may want to do something like verify a specific cat comes back when you query for a specific breed. But with a bulk random data insert like this you lose fine tuned control over knowing exactly what is in your DB. I see a strategy all over where people have a static seed class like MigrateAndSeed.SeedCats() I started with this and things quickly got complicated. I was forced to do MigrateAndSeed.SeedCatsAndBreedsAndColors().

bchavez commented 5 years ago

Hi Victorio,

Thank you for your issue. I'm sorry you're having trouble. I will try to help.

However, I'm not sure I fully understand your issue. Also, I don't have much experience using EF so I'm not familiar with HasData(). I'm more of an NHibernate :bear:/NoSQL:no_entry_sign: fan. :smiley_cat: If you could post a real compilable example that clearly demonstrates where circular dependencies are appearing that would really help. Also, if you could be a little more clear as to which sections of code are giving you trouble that would help too. As far as I can tell, the code you've posted looks like it would behave exactly as it was written.

The general sense of the problem I get from your issue is that you're looking for a way to generate deterministic outputs with Bogus so that you can use the fake data that Bogus generates in your unit tests and database.

I think the best I can do right now without a better understanding of your issue is to hopefully provide some useful guidelines for using Bogus with EF. They are as follows:

1. Generate Sequentially and Upfront

As a best practice and general rule of thumb, you'll want to generate and precompute your fake data sequentially and upfront before it touches the database or your unit tests. Here's an example:

public static class FakeData
{
   public static Faker<ColorCatLine> ColorCatLineFaker;
   public static Faker<Dog> DogFaker;
   public static Faker<Cat> CatFaker;
   public static Faker<Breed> BreedFaker;
   public static Faker<Color> ColorFaker;

   public static void Init()
   {
      const int numToSeed = 40;

      var colorIds = 1;
      ColorFaker = new Faker<Color>()
         .StrictMode(true)
         .UseSeed(1122)
         .RuleFor(d => d.Id, f => colorIds++)
         .RuleFor(d => d.ColorName, f => f.Internet.Color());

      Colors = ColorFaker.Generate(numToSeed);

      var breedIds = 1;
      BreedFaker = new Faker<Breed>()
         .StrictMode(true)
         .UseSeed(3344)
         .RuleFor(d => d.Id, f => breedIds++)
         .RuleFor(d => d.BreedName, f => f.Name.FirstName());

      Breeds = BreedFaker.Generate(numToSeed);

      var catIds = 1;
      CatFaker = new Faker<Cat>()
         .StrictMode(false)
         .UseSeed(5566)
         .RuleFor(d => d.Id, f => catIds++)
         .RuleFor(d => d.Name, f => f.Name.FirstName())
         .RuleFor(d => d.MeowLoudness, f => f.Random.Number(1, 10))
         .RuleFor(d => d.TailLength, f => f.Random.Number(1, 10))
         .RuleFor(d => d.BreedId, f => f.PickRandom(Breeds).Id);

      Cats = CatFaker.Generate(numToSeed);

      var dogIds = 1;
      DogFaker = new Faker<Dog>()
         .StrictMode(false)
         .UseSeed(7788)
         .RuleFor(d => d.Id, f => dogIds++)
         .RuleFor(d => d.Name, f => f.Name.FirstName())
         .RuleFor(d => d.BarkLoudness, f => f.Random.Number(1, 10))
         .RuleFor(d => d.TailLength, f => f.Random.Number(1, 10))
         .RuleFor(d => d.BreedId, f => f.PickRandom(Breeds).Id);

      Dogs = DogFaker.Generate(numToSeed);

      var catColorLineIds = 1;
      ColorCatLineFaker = new Faker<ColorCatLine>()
         .StrictMode(false)
         .UseSeed(9900)
         .RuleFor(d => d.Id, f => catColorLineIds++)
         .RuleFor(d => d.CatId, f => f.PickRandom(Cats).Id)
         .RuleFor(d => d.ColorId, f => f.PickRandom(Colors).Id);

      ColorCatLines = ColorCatLineFaker.Generate(numToSeed);
   }

   public static List<ColorCatLine> ColorCatLines { get; set; }
   public static List<Dog> Dogs { get; set; }
   public static List<Cat> Cats { get; set; }
   public static List<Breed> Breeds { get; set; }
   public static List<Color> Colors { get; set; }
}

Calling FakeData.Init() queues up the fake data in the static List<T> properties. I assume the accompanying EF code would translate into:

protected  void OnModelCreating(ModelBuilder modelBuilder)
{
   FakeData.Init();
   modelBuilder.Entity<Color>().HasData(FakeData.Colors);
   modelBuilder.Entity<Breed>().HasData(FakeData.Breeds);
   modelBuilder.Entity<Cat>().HasData(FakeData.Cats);
   modelBuilder.Entity<Dog>().HasData(FakeData.Dogs);
   modelBuilder.Entity<ColorCatLine>().HasData(FakeData.ColorCatLines);
}

As shown in the example above, FakeData.Cats can be used in your unit tests alone or be used to seed your database.

Notice: I'm using .UseSeed(n) method in each Faker<T> setup. It's a good idea to set a specific seed for each Faker<T> object. It can be the same seed or different it doesn't matter. What is important is that .UseSeed() sets up a localized random generator for each faker so that when Init() is called we get a repeatable and deterministic sequence of fake data that is independent of each Faker<T> object.

2. Avoid hard coding unit tests to specific values

Typically you may want to do something like verify a specific cat comes back when you query for a specific breed.

Indeed, it does seem typical to want to do this. However, if you also want to keep your Bogus NuGet package up to date, you should avoid checking for specific hard-coded values that Bogus generates. This kinda goes back to suggestion No.1 above. When checking that a specific cat comes back when you query for a specific breed, you should be checking against FakeData.Cat[x] and FakeData.Breed[y] instead of a hard-coded cat name like "Kody".

Bogus goes to great lengths to maintain deterministic behavior but there are some major ways that deterministic behavior can be "changed" or "disrupted". They are:

Bogus does consider deterministic sequence changes as breaking changes and is usually indicated by major semantic versioning number increments like Bogus v22.*.* vs Bogus v23.*.*. So it's something to be aware of.

Lastly, No.1 and No.2 can be taken to the next level to avoid the side effects of adding a "new property" to a domain object. When you add a "new property" to an object, you're usually adding a new .RuleFor(x = x.NewProperty, ...) and as a result, you are also changing the number of calls to the random generator which offsets the entire sequence chain of pseudo-random numbers by one (or more) for every object after the first object is created. To avoid this issue and achieve even better deterministic behavior with Bogus you can go a level deeper and set a seed value for each object being generated, not just at the Faker<T> level, see #104 for more info on how that can be done.

I hope this helps. Feel free to reach out and continue the discussion if you have any more questions.

Thanks, Brian

:crown: :gem: "I know everything that shine ain't always gold..."

VictorioBerra commented 5 years ago

This is a huge help, thank you!

bchavez commented 5 years ago

Hi Victorio,

No problem. Please feel free to continue posting here and let me know if you get stuck again. I'm very interested in understanding the pain points or any friction you encounter while using Bogus and seeing if there's anything we can do to make Bogus easier for everyone.

Thanks, Brian

:beach_umbrella: :trumpet: Beach Boys - Good Vibrations (Nick Warren bootleg)

VictorioBerra commented 5 years ago

A couple random questions,

  1. why 1122, 3344, 5566, ...?
  2. Why does everyone use statis classes/methods/properties for their seeding classes? I see this a lot.
  3. Sent you some bitcoin yesterday via the tipjar, thanks for all the help 😄

Tory

bchavez commented 5 years ago

Hi Tory,

Ah, so it was you! :sunglasses: Thank you very much for the BTC! :+1: :smiley_cat:

To answer your questions:

  1. why 1122, 3344, 5566, ...?

No reason in particular other than it was just a nice pattern to set different seed values. Basically, all we're doing is setting up the internal random generator with a seed value. I don't know if you're familiar with System.Random object, but check out the following code sample:

linqpad_1911

You can try writing the same code at your computer and you should get the same exact sequence as I do. So much for randomness hu? :game_die: hehe. If you re-run your program over and over again, you should get the same exact values over and over again. This is really useful for testing because you have a deterministic and repeatable way to draw randomness in your program.

Now, let's extend this concept further with Bogus. What we're doing in Bogus is "setting up" the seed value when we call new Faker<T>().UseSeed(1122) we are effectively doing the following:

linqpad_1912

You can write the same exact code on your computer and you should hopefully get the same sequence of names (instead of numbers). If you re-run your Bogus program over and over again, you should get the same exact sequence of values in the same order as shown above and below.

So, when we set seed values for every Faker<T> object, we're mixing it up a bit, ensuring we have a reproducible and deterministic way of generating the same fake data every time our program starts up to create fake data.

Why does everyone use static classes/methods/properties for their seeding classes? I see this a lot.

I'm not sure. I don't know but the feeling I get is statics "fit the bill". I guess the typical case for fake data is you want some singleton (and deterministic) source of truth for your fake data inside your program. Static storage is a quick way to satisfy those requirements. You certainly don't have to use static if you don't want to.

Hope that helps!

Feel free to reach out if you have any more questions. And thanks again for the BTC! :+1:

Thanks, Brian

:black_circle: :dizzy: "Black magic... bla-a-a-ack... black magic..."

VictorioBerra commented 5 years ago

I think I am getting this confused "2. Avoid hard coding unit tests to specific values"

If I know I need to write a test that confirms a null value was successfully updated do I:

1. Go back to my faker code and push a few lines of that specific value?

FakeData.cs


var catIds = 1;
CatFaker = new Faker<Cat>()
.StrictMode(false)
.UseSeed(5566)
.RuleFor(d => d.Id, f => catIds++)
.RuleFor(d => d.Name, f => f.Name.FirstName());
  NullCatFaker = new Faker<Cat>()
     .StrictMode(false)
     .UseSeed(5566)
     .RuleFor(d => d.Id, f => catIds++)
     .RuleFor(d => d.Name, f => null);

  Cats = CatFaker.Generate(numToSeed);
  Cats.AddRange(NullCatFaker.Generate(numToSeed));

> Tests.cs
```csharp
[Fact]
public async Task Assign_Cat_Name_Success()
{
      // Arrange
      var client = _factory.CreateDefaultClient();
      CatAPI _api = RestClient.For<CatAPI>(client);

      var nullCat = FakeData.Cats.Single(x => x.Name == null);

      // Act
      var createResponse = await _api.AssignCatNameAsync(nullCat);

      // Assert
      Assert.NotNull(createResponse.Name);
}

2. Seed "live" in my test?

[Fact]
public async Task Assign_Cat_Name_Success()
{
      // Arrange
      var client = _factory.CreateDefaultClient();
      CatAPI _api = RestClient.For<CatAPI>(client);

      var nullCat = default(Cat);

      using (var scope = _factory.Server.Host.Services.CreateScope())
      {
          var context = scope.ServiceProvider.GetRequiredService<CatContext>();

          nullCat = new CatEntity()
          {
              Name = null
          }

          context.CatEntity.Add(nullCat); // Or Add(FakeData.NullCatFaker)
          context.SaveChanges();
      }

      // Act
      var createResponse = await _api.AssignCatNameAsync(nullCat);

      // Assert
      Assert.NotNull(createResponse.Name);
}
bchavez commented 5 years ago

Hi Tory,

Both are okay, but personally, my preference is toward No.2.

Here's how I would do it:

[Fact]
public void Assign_Cat_Name_Success()
{
   // Arrange
   var client = _factory.CreateDefaultClient();
   var api = RestClient.For<CatAPI>(client);

   var nullCat = FakeData.CatFaker.Clone()
      .RuleFor(c => c.Name, f => null)
      .Generate();

   SaveCat(nullCat);

   // Act
   var result = await api.AssignCatNameAsync(nullCat);

   // Assert
   Assert.NotNull(result.Name);
}

internal void SaveCat(Cat cat)
{
   using (var scope = _factory.Server.Host.Services.CreateScope())
   {
      var context = scope.ServiceProvider.GetRequiredService<CatContext>();

      context.CatEntity.Add(cat);
      context.SaveChanges();
   }
}

Notice, in the code above, I'm using FakeData.CatFaker.Clone() to clone the CatFaker into a special case, one-off, faker that can generate a cat with a null name specifically for this test. So, the code as shown below can be completely avoided:

      NullCatFaker = new Faker<Cat>()
         .StrictMode(false)
         .UseSeed(5566)
         .RuleFor(d => d.Id, f => catIds++)
         .RuleFor(d => d.Name, f => null);

Special cases like this that are "one-off" tests (like checking if a cat with a null name can be set), you don't need to "precompute" up front. Just focus in on what you want to be tested with a Faker<T>.Clone().

The main motivation for precomputing fake data up front is mostly for seeding your database with data. For example, if you need to check the retrieval of an object by ID, then:

Hope that helps! Feel free to reach out if you have any more questions. :+1:

Thanks, Brian

:walking_man: :walking_woman: Missing Persons - Walking in L.A.