finos / datahelix

The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation
https://finos.github.io/datahelix/
Apache License 2.0
141 stars 50 forks source link

Using an inSetConstraint with weights and an if constraint causes weights to be ignored #1705

Closed Tom-hayden closed 3 years ago

Tom-hayden commented 4 years ago

Bug report

Using an inSet Constraint with weights alongside an if constraint appears to not be working correctly. The weights are ignored. Related to https://github.com/finos/datahelix/issues/1704

Steps to reproduce:

Using the profile

{
    "fields": [
        {
            "name": "percent10",
            "type": "integer"
        },
        {
            "name": "name",
            "type": "firstname",
            "nullable": true
        }
    ],
    "constraints": [
        {
            "field": "percent10",
            "inSet": "percent10.csv"
        },
        {
            "if": {
                "field": "percent10",
                "equalTo": 1
            },
            "then": {
                "field": "name",
                "isNull": false
            },
            "else": {
                "field": "name",
                "isNull": true
            }
        }
    ]
}

With the csv file percent10.csv

1,10
0,90

Run datahelix in random mode. Here are the command line arguments I used to run the profile:

generate
--max-rows=1000
--replace
--profile-file=D:\path\to\helix\profile.json
--output-path=D:\path\to\helix\out.csv
--output-format=csv
--generation-type=RANDOM
--set-from-file-directory=D:\path\to\helix

Expected result:

around 90% of names are null.

Actual result:

percent10,name
0,
1,Freya
1,Olivia
0,
1,Anna
1,Logan
1,Rory
0,
0,
0,
0,

Additional context:

This appears to be an issue with the ordering the constraints are evaluated in. See line68 RowSpecTreeSolver for a potential starting point.

sl-slaing commented 3 years ago

This sounds like it is due to the way the random value generator selects values from the set. I suspect this is where we need to look.

It gives the appearance that it is randomly selecting 0 or 1 where is should be randomly selecting from (in effect) 0,1,1,1,1,1,1,1,1,1 (or using the weightings to select the value).

With random, each row is independently produced, as such one row does impact the weighing of available values for a field in the next row. This is probably where the change is supposed to be, but could be patched with a pre-projection of weighted values like above (0,1,1,1,1,1,1,1,1,1)

sl-slaing commented 3 years ago

This bug seems to be a duplicate of #1704 suggest that it is closed and #1704 is used to track and resolve the issue.