finos / datahelix

The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation
https://finos.github.io/datahelix/
Apache License 2.0
141 stars 50 forks source link

Cardinality on column values #1682

Open gitsathish opened 4 years ago

gitsathish commented 4 years ago

Feature request

Wondering if there is a way to do this, impose cardinalities on columns. Example, Generate 10000 rows with an Integer column. Integer column min,max is 1 and 25000. But there should only be 100 unique values of the integer in the 10000 rows.

Similar, functionality for String would be useful as well.

tjohnson-scottlogic commented 4 years ago

Hi, thanks for reaching out. This can be achieved via the use of inSet (see the User Guide or example for further info), like this:

[tim@sn1 bin]$ cat profiles/cardinality.json
{
    "fields": [
    {
      "name": "an_integer",
      "type": "integer",
      "nullable": false
    }
  ],
  "constraints": [
    {
      "field": "an_integer",
      "inSet": "integer_set.csv"
    }
  ]
}
[tim@sn1 bin]$ cat profiles/integer_set.csv
1
25000
10
1000
[tim@sn1 bin]$ ./datahelix --profile-file=profiles/cardinality.json --max-rows 3 --quiet
an_integer
25000
1000
1

Would this approach work for you? This would work for any data type.