bentonam / fakeit

Generates JSON documents based on models defined in YAML and adds them to a Couchbase Bucket
MIT License
86 stars 21 forks source link

FakeIt Data Generator

Utility that generates fake data in json, yaml, yml, cson, or csv formats based on models which are defined in yaml. Data can be generated using any combination of FakerJS, ChanceJS, or Custom Functions.

Build Status Coverage Status

Dependency Status devDependency Status

Example of how it works

Generated data can be output in the following formats and destinations:

Install

npm install fakeit --save-dev
# or
npm install fakeit --global

CLI Usage

  Usage: fakeit [command] [<file|directory|glob> ...]

  Commands:

    console [options]                                          outputs the result to the console
    couchbase [options]                                        This will output to couchbase
    sync-gateway [options]                                     no idea
    directory|folder [options] [<dir|file.zip>] [<models...>]  Output the file(s) into a directory
    help

  Options:

    -h, --help           output usage information
    -V, --version        output the version number
    --root <directory>   Defines the root directory from which paths are resolve from (process.cwd())
    --babel <glob>       The location to the babel config (+(.babelrc|package.json))
    -c, --count <n>      Overrides the number of documents to generate specified by the model. Defaults to model defined count
    -v, --verbose        Enables verbose logging mode (false)
    -S, --no-spinners    Disables progress spinners
    -L, --no-log         Disables all logging except for errors
    -T, --no-timestamp   Disables timestamps from logging output
    -f, --format <type>  this determines the output format to use. Supported formats: json, csv, yaml, yml, cson. (json)
    -n, --spacing <n>    the number of spaces to use for indention (2)
    -l, --limit <n>      limit how many files are output at a time (100)
    -x, --seed <seed>    The global seed to use for repeatable data

Models

All data is generated from one or more YAML files. Models are defined similarly to how models are defined in Swagger, with the addition of a few more properties that are used for data generation:

At the root of a model the following keys are used, if it's not required then it's optional

name (required)

The name of the model

type

The data type of the model to be generated. This needs to be set top level, as well as a per property/items basis. It determines the starting data type, and how the result of the build loop will be converted once complete

Note: If type isn't set it defaults to 'null'.

Available types
types data type description
number, long, integer 0 Converts result to number using parseInt
double, float 0 Converts result to number using parseFloat
string '' Converts result to a string using result.toString()
boolean, bool false Converts result to a boolean if it's not already, if result is a string and is 'false', '0', 'undefined', 'null' it will return false
array [] returns the result from the build loop
object, structure {} returns the result from the build loop
null, undefined, * (anything else) null returns the result from the build loop
Places where it can be set
name: Types example
# typically object or array
type: object
key:
  build: faker.random.uuid()
properties:
  foo:
    # can be set on properties of an object
    type: object
    properties:
      bar:
        # can be set on nested properties
        type: string
        data:
          value: FakeIt ftw
  bar:
    type: array
    items:
      # can be set on items
      type: string
      data:
        min: 1
        max: 10
        build: faker.random.word()

data

This is the main data object that is uses the same properties in several different situations.

The following keys can only be defined in the top level data object

key (required)

This determines the name of the document that's being generated. It only needs to be defined once per document. This is a reference to a generated property and is used for the filename or Document ID. If the key is an object it needs the data option defined above, it will only work with value, build, and fake since this already runs after the document has been built. If the key is a string then it use the string value to find the value of the document that was just built (using the lodash get method).

Examples of setting a key

In this example after each document is built it will look for the _id property and return it's result (aka user_1, user_2, etc.)

name: Key String Example
type: object
key: _id
data:
  pre_run: |
    globals.user_counter = 0;
properties:
  _id:
    type: string
    description: The document id
    data:
      post_build: `user_${this.user_id}`
  user_id:
    type: integer
    description: The users id
    data:
      build: ++globals.user_counter

In this example the key will be 'user_' + the current user_id (aka user_1, user_2, etc.)

name: Key Object Example
type: object
key:
  data:
    build: `user_${this.user_id}`
data:
  pre_run: |
    globals.user_counter = 0;
properties:
  user_id:
    type: integer
    description: The users id
    data:
      build: ++globals.user_counter

seed

If a seed is defined it will ensure that the documents created repeatable results. If you have a model with a data range of 2-10 a random number between 2 and 10 documents will be created no matter what the seed is. Let's say that 4 documents are generated the first time you run the model, each of those documents will be completely different than the next (as expected). Later you come back and you generate the data again this time it might generate 6 documents. The first 4 documents generated the second time will be exactly the same as the first time you generated the data. The seed can be number or string.

Note:

This only works if you use faker and chance to generate your random fake data. It can be produced with other fake data generation libraries if they support seeds.

faker.date functions will not produce the same fake data each time.

Functions

For any function defined above be sure to use | for multi line functions and NOT >. To see an in depth explanation see this issue

Each of these functions is passed the following variables that can be used at the time of it's execution:

For the pre_run, and post_run the this context refers to the current model. For the pre_build, build, and post_build the this context refers to the object currently being generated. If you have a nested object being created in an array or something, this will refer to closest object not the outer object/array.

Example users.yaml Model

name: Users
type: object
key:
  data:
    build: `user_${this.user_id}`
data:
  min: 200
  max: 500
  pre_run: |
    globals.user_counter = 0;
properties:
  user_id:
    description: The users id
    data:
      build: faker.random.uuid()
  name:
    description: The users first name
    data:
      fake: '{{name.firstName}}'
  last_name:
    description: The users last name
    data:
      fake: '{{name.lastName}}'
  username:
    description: The users username
    data:
      fake: '{{internet.userName}}'
  password:
    description: The users password
    data:
      fake: '{{internet.password}}'
  email:
    description: The users email address
    data:
      fake: '{{internet.email}}'
  phone:
    description: The users mobile phone
    data:
      fake: '{{phone.phoneNumber}}'
      post_build: this.phone.replace(/x[0-9]+$/, '')

Results in the following

{
  "user_id": "4d9ec95c-f45d-42f4-9d32-4ac81d83f95b",
  "name": "Sandy",
  "last_name": "Turner",
  "username": "Zella61",
  "password": "gi7NVXsUoARHhyU",
  "email": "Buck_Cormier@hotmail.com",
  "phone": "715.612.8609"
}
{
  "user_id": "7f513d5b-f944-4a80-b52a-4876627368b7",
  "name": "Duane",
  "last_name": "VonRueden",
  "username": "Mafalda92",
  "password": "3uXo4hFZJTdf1hp",
  "email": "Rickie_Braun@hotmail.com",
  "phone": "(356) 009-7477 "
}
...etc

properties

This is used to define out the properties for an object.

Each key inside of the properties will be apart of the generated object. Each of the keys use the following properties to build the values.

name: test
key:
  build: faker.random.uuid()
type: object
properties:
  id:
    data:
      build: faker.random.uuid()
  title:
    type: string
    description: The main title to use
    data:
      # single line is returned just like arrow functions
      build: |
        faker.random.word()
  phone:
    type: object
    # This can be nested under another key
    properties:
      home:
        type: string
        data:
          # this will also be returned
          build: faker.phone.phoneNumber().replace(/x[0-9]+$/, '')
      work:
        type: string
        data:
          # this will also be returned
          build: chance.bool({ likelihood: 35 }) ? faker.phone.phoneNumber().replace(/x[0-9]+$/, '') : null

This will return a object like this

{
  "id": "4ce4da5c-0614-47d3-8fd6-3614c5461830",
  "title": "alliance",
  "phone": {
    "home": "(949) 194-3347",
    "work": "314-939-0541"
  }
}
{
  "id": "a649bbec-d629-4594-8fc8-ae34d97811a2",
  "title": "Unbranded",
  "phone": {
    "home": "012-296-9810",
    "work": null
  }
}

etc...

items

This is used to define out how each item in an array is built It uses the same structure as properties does but it will return an array of values.

name: Array example
key:
  data:
    build: faker.random.uuid()
type: object
properties:
  keywords:
    type: array
    description: An array of keywords
    items:
      type: string
      data:
        min: 3
        max: 10
        build: faker.random.word()
  # You can also create a array of objects
  phones:
    type: array
    description: An array of phone numbers
    items:
      type: object
      data:
        min: 1
        max: 3
      properties:
        cell:
          type: string
          data:
            build: faker.phone.phoneNumber().replace(/x[0-9]+$/, '')
        home:
          type: string
          data:
            build: chance.bool({ likelihood: 45 }) ? faker.phone.phoneNumber().replace(/x[0-9]+$/, '') : null
        work:
          type: string
          data:
            build: chance.bool({ likelihood: 10 }) ? faker.phone.phoneNumber().replace(/x[0-9]+$/, '') : null
{
  "keywords": [ "GB", "Sports", "redundant", "Plastic", ],
  "phones": [
    {
      "cell": "(555) 555 - 5555",
      "home": "(666) 666 - 6666",
      "work": null
    },
    {
      "cell": "(777) 777 - 7777",
      "home": null
      "work": "(888) 888 - 8888",
    }
  ]
}

Model References

It can be beneficial to define definitions that can be referenced one or more times throughout a model. This can be accomplished by using the $ref: property. Consider the following example:

contacts.yaml

name: Contacts
type: object
key: contact_id
data:
  min: 1
  max: 4
properties:
  contact_id:
    data:
      build: "chance.guid()"
  details:
    schema:
      $ref: '#/definitions/Details'
  phones:
    type: array
    items:
      $ref: '#/definitions/Phone'
      data:
        min: 1
        max: 4
  emails:
    type: array
    items:
      $ref: '#/definitions/Email'
      data:
        min: 0
        max: 3
  addresses:
    type: array
    items:
      $ref: '#/definitions/Address'
      data:
        min: 0
        max: 3
definitions:
  Email:
    data:
      build: "faker.internet.email()"
  Phone:
    type: object
    properties:
      phone_type:
        data:
          build: "faker.random.arrayElement([ 'Home', 'Work', 'Mobile', 'Main', 'Other' ])"
      phone_number:
        data:
          build: "faker.phone.phoneNumber().replace(/x[0-9]+$/, '')"
      extension:
        data:
          build: "chance.bool({ likelihood: 20 }) ? chance.integer({min: 1000, max: 9999}).toString() : ''"
  Address:
    type: object
    properties:
      address_type:
        data:
          build: "faker.random.arrayElement([ 'Home', 'Work', 'Other' ]);"
      address_1:
        data:
          # This uses es6 and only works if your project already has it install or you're on node 6+
          build: "`${faker.address.streetAddress()} ${faker.address.streetSuffix()}`"
      address_2:
        data:
          build: "chance.bool({ likelihood: 35 }) ? faker.address.secondaryAddress() : ''"
      city:
        data:
          build: "faker.address.city()"
      state:
        data:
          build: "faker.address.stateAbbr()"
      postal_code:
        data:
          build: "faker.address.zipCode()"
      country:
        data:
          build: "faker.address.countryCode()"
  Details:
    type: object
    properties:
      first_name:
        data:
          fake: "{{name.firstName}}"
      last_name:
        data:
          build: "return chance.bool({ likelihood: 70 }) ? faker.name.lastName() : ''"
      company:
        type: string
        description: The contacts company
        data:
          build: "return chance.bool({ likelihood: 30 }) ? faker.company.companyName() : ''"
      job_title:
        type: string
        description: The contacts job_title
        data:
          build: "return chance.bool({ likelihood: 30 }) ? faker.name.jobTitle() : ''"

For this model we used 4 references:

These could have been defined inline but that would make it more difficult to see our model definition, and each of these definitions can be reused. References are processed and included before a model is run and it's documents are generated.

Overriding Model Defaults

The model defaults can be overwritten at run time by executing the pre_run function. The this keyword in both the pre_run and post_run functions is the processed model. Below are some examples of changing the number of documents the model should generate before the generation process starts.

name: Users
type: object
key: _id
data:
  pre_run: |
    this.data.count = 100
# etc...

This becomes beneficial if you are providing input data and want to generate a fixed number of documents. Take the following command for example:

Here we want to generate a countries model but we might not necessarily know the exact amount of data being provided by the input. We can reference the input data in our model's pre_run function and set the number to generate based on the input array.

name: Countries
type: object
key: _id
data:
  inputs: '../inputs/countries.csv'
  pre_run: |
    this.data.count = inputs.countries.length;
# etc...

JS API

If you don't want to use the CLI version of this app you can always use the JS api.

import Fakeit from 'fakeit'
const fakeit = new Fakeit()

fakeit.generate('glob/to/models/**/*.yaml')
  .then((data) => {
    console.log(data)
  })

Fakeit Options

Below are the default options that are used unless overwritten.

import Fakeit from 'fakeit'
const fakeit = new Fakeit({
  root: process.cwd(), // The root directory to operate from
  babel_config: '+(.babelrc|package.json)', // glob to search for the babel config. This search starts from the closest instance of `node_modules`
  seed: 0, // the seed to use. If it's 0 then a random seed is used each time. A string or a number can be passed in as an seed
  log: true, // if then logging to the console is enable
  verbose: false, // if true then verbose logging is enable
  timestamp: true, // if true the logging output to console has timestamps
})

// models can be an a comma delimited string of globs, or an array of globs
// any models that are passed will output/returned.
const models = 'glob/to/models/**/*.yaml'

fakeit.generate(models, {
  // this is the format to output it in
  // available formats `json`, `csv`, `yaml`, `yml`, `cson`
  format: 'json',

  // the character(s) to use for spacing
  spacing: 2,

  // The type of output to use. Below are the available types
  // `return`: This will the data in an array
  // `console`: This will output the data to the console
  // `couchbase`: This will output the data to a Couchbase server.
  // `sync-gateway`: This will output the data to a Couchbase Sync Gateway server
  // `directory`: The directory path to output the files (aka `path/to/the/destination`)
  output: 'return',

  // limit how many files are output at a time, this is useful
  // to not overload a server or lock up your computer
  limit: 100,

  // this is used in the console output and if true it will
  // format and colorize the output
  highlight: true,

  // the file name of the zip file. Currently this can only be used if you're
  // outputting the data to a directory. It can't be used to output a zip file
  // to a server, the console, or returned. (aka `archive.zip`)
  archive: '',

  // These options are used if the `output` option is `sync-gateway`,
  // or `couchbase`. Otherwise they're ignored.
  server: '127.0.0.1', // the server address to use for the server
  bucket: 'default', // the bucket name
  username: '', // the username to use if applicable
  password: '', // the password for the account if applicable
  timeout: 5000, // timeout for the servers
})
  .then((data) => {
    // the data returned will always be a string in the format that was set
    data = JSON.parse(data)
    // do something with data array of arrays
  })

Examples

To see more examples of some of the things you can do take a look at the test cases that are in this repo

Changelog