faker-js / faker

Generate massive amounts of fake data in the browser and node.js
https://fakerjs.dev
Other
12.43k stars 896 forks source link

Adding Random probability distribution function #1862

Open maxime4000 opened 1 year ago

maxime4000 commented 1 year ago

Clear and concise description of the problem

So I'm seeding a database with faker. I have field that allow array of some type. I want to generate multiple array, but with different size. Some where the array is empty, some where the array has 1 elements and some where the array has multiple elements.

Most of the case will have one element in the array, but I also want to test limit case, so having a way to generate Random distributed data would be nice.

const isEmpty = faker.datatype.boolean(); // ~50%
const isOneElement = faker.datatype.boolean(); // ~25%
const length = faker.datatype.number(100); // ~25%
const array = 
isEmpty ? []
        : isOneElement 
          ? [getFakerFunction(field)] 
          : Array.from({length}, () => getFakerFunction(field));
return array

Let's said that I'm faking an array of value and I want some length to be more common than others. It's common to have an array of length 1 to 3 but it's very rare to have an array of 100. I would like to have a random probability distribution function for this.

Suggested solution

In my case, I'm looking for a random exponential distribution.

The function would accept an argument like this:

type ExponentialDistributionOptions = {
    min?: number;
    max?: number;
    precision?:number;
    curveSettings: {
        deviation?:number;
        mean?: number;
        // ...
    }
}

And would generate a number using the distribution called. I would expect to call faker.random.exponentialDistribution({min: 0, max: 100, curveSettings: {...}}) and the number generated from this would have more chance to be closer to 0 than closer to 100. On a scale of 1000 random value generated, we could see few value with a number close to 100.

I wouldn't limit the feature to only exponential distribution, I would also add gaussian distribution, Rayleigh distribution, gamma distribution, etc...

Alternative

No response

Additional context

I'm not sure if what I'm asking is out of scope for faker, but at the same time, faker is generating data from a random value. Why would faker couldn't generate number base on some probability of that number to be generated?

Btw, I'm no mathematician, so I might be incorrect with what I explain, but I still think faker could add some random probability distribution function.

ST-DDT commented 1 year ago

Do you refer to something like this?

function exponentialDistributionNumber(start = 1, stepScale = 2, stepProbability = 0.5, limit = Number.MAX_SAFE_INTEGER) {
    let max = start;
    while(faker.datatype.boolean(stepProbability) && max < limit) {
        max *= stepScale;
    }
    return faker.number.int({ min: 0, max: Math.min(max, limit) });
}
Result occurrences for 1 Mio runs of exponentialDistributionNumber(1, 2, 0.5, 100) 0: 367108 1: 368619 2: 117775 3: 34374 4: 34582 5: 9489 6: 9445 7: 9374 8: 9518 9: 2549 10: 2571 11: 2515 12: 2571 13: 2442 14: 2549 15: 2482 16: 2511 17: 661 18: 677 19: 656 20: 660 21: 649 22: 672 23: 684 24: 651 25: 662 26: 612 27: 659 28: 654 29: 641 30: 653 31: 653 32: 692 33: 212 34: 204 35: 195 36: 178 37: 200 38: 178 39: 192 40: 212 41: 201 42: 212 43: 219 44: 189 45: 194 46: 203 47: 203 48: 209 49: 161 50: 210 51: 200 52: 199 53: 189 54: 196 55: 175 56: 196 57: 166 58: 199 59: 188 60: 191 61: 187 62: 192 63: 193 64: 185 65: 78 66: 73 67: 73 68: 90 69: 84 70: 63 71: 83 72: 87 73: 59 74: 73 75: 65 76: 70 77: 83 78: 91 79: 88 80: 72 81: 75 82: 80 83: 61 84: 73 85: 83 86: 78 87: 78 88: 68 89: 60 90: 77 91: 94 92: 82 93: 67 94: 68 95: 79 96: 79 97: 77 98: 76 99: 90 100: 85

grafik

Would something like this suffice or do you need more/something else?

maxime4000 commented 1 year ago

Interesting! Yes something like this would suffice. That would be nice if it was implemented as an API function.

matthewmayer commented 1 year ago

Something similar could also be achieved by having a variant of faker.helpers.arrayElement where each element of the array has a fixed independent probability of being included in the return values

ST-DDT commented 1 year ago

Something similar could also be achieved by having a variant of faker.helpers.arrayElement where each element of the array has a fixed independent probability of being included in the return values

Like helpers.weightedArrayElement? Well not really but close when used for the length.

xDivisionByZerox commented 1 year ago

Team decision

There is an existing workaround for this problem. We are currently unsure about implementation details regarding the distribution.

If you want/need this feature please upvote this issue.

github-actions[bot] commented 1 year ago

Thank you for your feature proposal.

We marked it as "waiting for user interest" for now to gather some feedback from our community:

ST-DDT commented 10 months ago

Here an improved version of the function:


/**
 * Generates a random number between min and max using an exponential distribution.
 * The lower bound is inclusive, but the upper bound is exclusive.
 *
 * @param options The options for generating the number.
 * @param options.min The minimum value to generate. Defaults to `0`.
 * @param options.max The maximum value to generate. Defaults to `1`.
 * @param options.bias The bias of the distribution. Must be greater than 0. Defaults to 1.
 * The lower the bias, the more likely the number will be closer to the min (0-1@0.1 -> avg: ~0.025).
 * A bias of 1 will generate the default exponential distribution (0-1@1 -> avg: ~0.202).
 * The higher the bias, the more likely the number will be closer to the max (0-1@10 -> avg: ~0.691).
 *
 * @throws If bias is less than or equal to 0.
 * @throws If max is less than min.
 */
function exponentialDistributionNumber(
  options:
    | number
    | {
        /**
         * The minimum value to generate.
         *
         * @default 0
         */
        min?: number;
        /**
         * The maximum value to generate.
         *
         * @default 1
         */
        max?: number;
        /**
         * The bias of the distribution. Must be greater than 0.
         *
         * The lower the bias, the more likely the number will be closer to the min (0-1@0.1 -> avg ~0.025).
         * A bias of 1 will generate the default exponential distribution (0-1@1 -> avg ~0.202).
         * The higher the bias, the more likely the number will be closer to the max (0-1@10 -> avg ~0.691).
         *
         * @default 1
         */
        bias?: number;
      }
) {
  if (typeof options === 'number') {
    options = { max: options };
  }

  const { min = 0, max = 1, bias = 1 } = options;

  if (bias <= 0) {
    throw new FakerError('Bias must be greater than 0');
  }

  if (max === min) {
    return min;
  }

  if (max < min) {
    throw new FakerError(`Max ${max} should be greater than min ${min}.`);
  }

  const random = faker.number.float(); // [0,1)
  const exponent = random ** (1 / bias); // [0,1)
  const range = max - min + 1; // +1 to account for x ** 0 = 1
  return min + range ** exponent - 1; // -1 to account for x ** 0 = 1
}

Generating 100kk values between 0-100:

grafik