finos / datahelix

The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation
https://finos.github.io/datahelix/
Apache License 2.0
141 stars 50 forks source link

Generate value for a nullable column with a percentage #1704

Open semisft opened 4 years ago

semisft commented 4 years ago

Some column values must be filled by a percentage, for example one field must be 10% filled, another 30% in the same profile. For %10 I tried a field from weighted inSet file and used in an if statement. but results seem to give %50. How can I configure this?

percent10.csv

1,10 0,90

profile.json

{
    "fields": [
        {
            "name": "percent10",
            "type": "integer"
        },
        {
            "name": "name",
            "type": "firstname",
            "nullable": true
        }
    ],
    "constraints": [
        {
            "field": "percent10",
            "inSet": "percent10.csv"
        },
        {
            "if": {
                "field": "percent10",
                "equalTo": 1
            },
            "then": {
                "field": "name",
                "isNull": false
            },
            "else": {
                "field": "name",
                "isNull": true
            }
        }
    ]
}
Tom-hayden commented 4 years ago

Hi @semisft, this appears to be a bug with the datahelix. I have raised an issue for it here https://github.com/finos/datahelix/issues/1705

sl-slaing commented 3 years ago

I've tried this issue with the above profile given the latest edition of the code (to verify if the issue still exists). An example of the output (30 rows) is below:

percent10,name
1,Rory
1,Lily
1,Finn
0,
0,
0,
0,
0,
1,Amelia
1,Thea
1,Zara
1,Christina
1,Jake
0,
1,Maya
1,Liam
0,
1,Zac
1,Hamish
0,
0,
0,
0,
1,Lila
0,
0,
0,
1,Frank
0,
1,Phoebe

This shows a 50% spread of each of the values for percent10, where there should be 10% (3 rows) with 0 and 90% (27 rows) with 1. The issue is still confirmed to be valid - will investigate further.

sl-slaing commented 3 years ago

Investigation: In RandomRowSpecDecisionTreeWalker a list of rowSpecs are generate that represent the rows that can be generated. These are generated as:

  1. name=not null & in (names) and percent10=not null & in (1)
  2. name=null and percent10=not null & in (0)

The generator will then randomly select between the two items above to generate rows. The items above do not have any weighting however (which could have been inherited from the value for percent10) so the generator generates (randomly) an even spread of rows from the two specs above.

Either of the below (or something more elegant) would be required:

  1. The items above need to indicate their weighting, i.e. item1 = 10% and item2 = 90% and use this in the getRandomRowSpec() method
  2. The items above are duplicated as many times as appropriate to create a representative spread, i.e. create 9 item2's for every 1 item1. Then there would be a sample of row specs that can be randomly selected from
  3. something else