faker-js / faker

Generate massive amounts of fake data in the browser and node.js
https://fakerjs.dev
Other
12.56k stars 903 forks source link

Deterministic values #1413

Closed KevinMind closed 1 year ago

KevinMind commented 2 years ago

Clear and concise description of the problem

We use fake in several different contexts. It is an amazing tool. However in some contexts, it is valuable to have fully fully idempotent values returned from faker functions. For example:

To be utterly clear, when I say idempotent I mean that the faker function should return the exact same value every time it is called with the same parameters. Essentially we set the randomness to 0

Suggested solution

I could imagine a few solutions, but perhaps the most simple would be to either consider faker to be in one mode or another globally. We could do this with an environment level feature flag, FAKER_IDEMPOTENT_MODE=true.

This flag can be stored in the global faker instance and used to control the logic of modules.

We could implement the underlying logic in a couple of ways:

Using this approach we can also implement idempotent modules over time, we don't need to do it in a big bang. When the flag is enabled we could add a warning log that this feature is under development, and that only certain modules have been implemented with idempotent mode. If a module is not yet implemented, it will return the normal module/method.

Given a file script.js

import {faker} from '@faker-js/faker';

console.log(faker.datatype.number({min:0, max:100});

Running normally you get a random value, in this case 42.

node script.ts // 42

With idempotent mode, given the logic I've defined for how to implement idempotence we should get the average of the min/max values which would always be 50.

FAKER_IDEMPOTENT_MODE=true node script.ts // 50

Alternative

Another solution that is perhaps more elegant, but much more complex would be to introduce a concept of randomness to faker. You could specify on each call, or perhaps globally the level of randomness faker should use.

faker.setRandomness(0.2)

This value could then be used to determine the precise level of randomness to introduce into the underlying methods. Since we don't have true randomness in CS we could theoretically do this, though I imagine it would be very difficult and for my particular use case is well beyond over-engineering.

One final solution to this might be to use faker when you want random values and static values when you want consistent values. This solves the problem at the wrong level of the stack though. Most people who use faker wrap it in functions that generate objects of specific types (it is literally the first example of the docs) and needing to switch between these functions and some other data source on the fly is a lot of complexity that would be much better integrated into faker itself.

Additional context

I would be more than happy to work on this feature as I would definitely benefit from having it!

I'm not 100% familiar with how all of faker is implemented and I imagine there might be some edge cases I'm missing where idempotency might be tricky. For example, how would this feature interact with the concept of seeding? What if you want to mix idempotent and dynamic values, for example when generating a list of Users where the name should be random but the age should be the same?

Shinigami92 commented 2 years ago

I might understand somewhat what you are going for, but anyways: Could you explain the total difference between your proposal and using faker.seed()?

KevinMind commented 2 years ago

@Shinigami92, using the example from the docs,

faker.seed(123);

const firstRandom = faker.datatype.number();

// Setting the seed again resets the sequence.
faker.seed(123);

const secondRandom = faker.datatype.number();

console.log(firstRandom === secondRandom);

In this case I can guarantee that firstRandom and secondRandom are equal to eachother, but I cannot guarantee what the actual value would be.

If my proposal existed, I would be able to do something like

const deterministicNumber = faker.datatype.,number({min: 0, max: 100});

console.log(deterministicNumber === 50);

And I would be able to tell you exactly what the number would be, literally.

To double down on the benefit of such a reality existing, I could write a story like.


function Counter({count}) {
   return <div className="my-stylistic-genius">{count}</div>;
}

export default {
  component: Counter,
}

export const Default = {
   args: {
    count: faker.datatype.number(),
  },
};

I could load this story using normal faker mode and see how the component renders with infinite different numbers, but if I want to run a regression test against changes to the styles, I would want the count to be the same, so that there would be no pixels influenced by the prop. In this case, I can just write the story with normal faker values, and when I need literal consistency, aka idempotence, I can just flip it on.

ST-DDT commented 2 years ago

Why is it important that the value is 50 instead of the "same as last time"?

Please note that even with your suggestion of idempotent mode the result would potentially be different if you update faker, because of updates to the underlying locale data. If you need the exact same values, you have to store them somewhere yourself (you could potentially use some kind of mocking/spying to populate the file and read from it later on). Please note that especially values in the middle are less likely to hit corner cases (such as too short or too long strings). Also generated alpha(numeric) strings might look weird. E.g. wwwwwwwwwww.

KevinMind commented 2 years ago

Great questions and alternatives @ST-DDT.

Why is it important that the value is 50 instead of the "same as last time"?

The same as last time is good if you want to be able to recall the same value within a single "session" spanning the length of the seed. So yeah, seeds are great.

However, I want to be able to basically make a seed that lasts across multiple sessions. I want to be able to run visual testing tools using faker and be able to ensure that if I run the same test in 2 months, it will give me back the same data. I don't want to have to manually write a bunch of static mock data also because faker literally exists to avoid doing that.

I totally understand the drawbacks of this solution. It's not perfect. You will have to pass some values statically sometimes, but with this we can minimize those values to only the critical edge case ones like you mentioned.

Problems with mocking with files

In the real world trying to implement the mocking/spying to populate the file and read from it later on doesn't work very well.

  1. A lot of manual work
  2. prone to error
  3. difficult to enforce type safety on static files (json)
  4. difficult to know when to update the file or not.

I've been looking into this as a solution, but each time I look at it, it seems like just making faker return static values solves the whole problem.

Another alternative

Would another approach be offering a hook into faker itself? Let me define for faker how to create a faker.datatype.number. This gives me the behaviour I'm looking for and means none of the logic for determining what values get returned need to be implemented or maintained within faker.

We could wrap each faker method in a dot method mockImplementation or something like that. This would accept a function with the same signature as the initial function and when defined would replace that methods implementation. Maybe there would also be a clearMock or something. IDK.

I imagine it would look something like...

import {faker} from '@faker-js/faker';

faker.datatype.number.mockImplementation((options) => {
  if (typeof options === 'number') {
    return options / 2;
  }
  if (options?.min && options?.max) {
    const range = options.max - options.min;
    return options.max - (range / 2);
  }
  if (options?.max) {
    return options.max / 2;
  }
  if (options?.min) {
    return options.min * 2;
  }
  return 10;
});

faker.datatype.number(2); // 1
faker.datatype.number({min:0, max: 100}); // 50
faker.datatype.number({min: 10}); // 20
faker.datatype.number({max: 10}); // 5

@ST-DDT can you point out some of your concerns if something like either of these two proposals were implemented? What are the downsides I'm missing?

Shinigami92 commented 2 years ago

if I run the same test in 2 months

I still don't understand why you don't just use faker.seed(1234) and don't update your version of faker :thinking: It will be exact the same values in two month as today (expect you add method calls to faker in between your code lines which obviously calls mersenne under the hood)

We wont implement something like mockImplementation into faker itself, as there are already libraries for mocking out there you can use today

ST-DDT commented 2 years ago

Just for clarification purposes.

faker.seed(1337); // Assuming 1337 to be a hardcoded value
faker.name.firstName(); // => Devyn

Will result in 100% reproducible results (if you use the same faker version e.g. 7.5.0.) Try it out. It does not matter where you run it or when, it will always result in the same result. As long as you execute the faker methods in the same order, you will get the same result.

We have test suits to ensure we always get predictable results: fakerjs test snapshots See also: https://vitest.dev/guide/snapshot.html

You can try them as well:

As for your proposal overwriting the methods. In v8 (development will start soon) we plan to modularize faker some more including the ability to add or omit certain parts to/from faker. You can use that to create your own faker instance that returns the values you need.

https://github.com/faker-js/faker/milestone/11

Does that help you?

KevinMind commented 2 years ago

Hi. So after digging deeper into seed() it seems it is better suited than I had initially though, but still there are problems.

Here is a sample repo I setup: https://github.com/Shopify/faker-seed-test

I added chromatic and storybook to test visual snapshots using faker functions to populate the data. When there is a chromatic=true URL param then the faker seed is set before storybook initializes which should set a global seed. This works if there are no changes to the number or order of faker calls in the whole project

image

However, if you change any of the faker calls, or even their order, you end up changing all of the data, which causes all snapshots to fail.

image image

I think what I need is to set a new faker seed for each story, so that only if the calls to fake in that story change, the data is affected. This is not so easy to implement though it seems as many times the calls to faker are before the story even renders. There are also a lot of different places where a call to faker might happen, and even one change can totally throw off all the data.

Any ideas on how I could work around this issue with existing functionality in faker?

KevinMind commented 2 years ago

Tried some more approaches and maybe found one that works. If I wrap every story definition in a function that sets the seed then I can change the order of stories, and add any number of extraneous faker calls without really breaking the data produced. Seems to work. I will continue trying to break it but it seems indeed that faker.seed() can probably help me achieve what I'm looking to achieve. I'll report back any issues I run into tomorrow.

KevinMind commented 2 years ago

@ST-DDT Can you correct me if I'm wrong, using a seed should result in 100% deterministic data? Does that include for date/time calls? Looks like when calling for dates or git methods that use date-time strings return different values each time. Is this expected?

image image
Shinigami92 commented 2 years ago

[...] Does that include for date/time calls? Looks like when calling for dates or git methods that use date-time strings return different values each time. Is this expected?

Currently yes, you need to pass a refDate otherwise the reference Date would be now.

KevinMind commented 1 year ago

It seems the only method which is inconsolably random is faker.git.commitEntry which includes a random date in the string. If this method allowed for a date method, we would be 100% good to go. I could have a look and open a PR if you are not against such an option?

ST-DDT commented 1 year ago

Yes, the git commit method should have an option to use a fixed ref date in order to allow for reproducible results. Your contribution will be appreciated.


We are also considering adding a faker.fork method to simplify creating deterministic value sets (e.g. multiple persons).

See this comment for details: https://github.com/faker-js/faker/issues/627#issuecomment-1070650803

KevinMind commented 1 year ago

@ST-DDT I've created a branch with the feat/fix (not sure if this should count as a bug fix or a new feature)

But I cannot push to the repo. I didn't read anything in the contributing guide about needing to fork for contributions. Am I missing something? Could you point me in the right direction?

kevinmeinhardt@kevins-mbp faker %     git push --set-upstream origin fix-add-refdate-to-git-commitentry
remote: Permission to faker-js/faker.git denied to KevinMind.
fatal: unable to access 'https://github.com/faker-js/faker.git/': The requested URL returned error: 403
kevinmeinhardt@kevins-mbp faker % 

Is what I get in the terminal. Some kind of auth issue it seems.

ST-DDT commented 1 year ago

But I cannot push to the repo. I didn't read anything in the contributing guide about needing to fork for contributions.

You have to fork the repository in order to propose changes. You can delete the fork after the changes have been merged.

Would you like to create a second PR that extends the contributing guide?

ST-DDT commented 1 year ago

@KevinMind Do you consider this fixed once #1512 is merged, or is here more to do?

KevinMind commented 1 year ago

I'd like to run another check across all the methods to ensure I can get 100% deterministic values with a specific seed set but in principle yes.

KevinMind commented 1 year ago

@ST-DDT I opened a new task as there were a few things I think I could add to the contributing guide and wanted to be able to clarify before opening a PR. https://github.com/faker-js/faker/issues/1513

xDivisionByZerox commented 1 year ago

I'll consider this fixed and close the issue. If you think otherwise feel free to reopen.