galatea-associates / fuse-test-data-gen

Repository for the Galatea internal data generator tool, used for generating domain data for POCs
0 stars 0 forks source link

Weight distribution of dependencies #216

Open WilfGala opened 4 years ago

WilfGala commented 4 years ago

Issue Description

Currently every local database record of a domain object has an equal chance of being selected as a dependency provided it meets any additional criteria. For example, every Account record with an "account_type" of "depot" in the database accounts table has an equal chance of being randomly chosen to have it's account_id attribute referenced in the depot_id attribute of a Depot Position domain object record.

It would be desirable to allow users to weight the distribution of dependencies in some way such that in the example above some accounts are referenced by significantly larger numbers of depot positions than others. For example, one depot-type account may be referenced by 500 depot positions, and another by say none at all.

Design

The user-facing config.json shoyuld be amended to allow users to specify the distribution of dependencies in some way.

A simple way of doing this might be to have a rule that applies to all dependencies across the board of the format:

A specific X% of dependant objects will be used Y% of the time, and the other (100-X)% will be used the other (100-Y)% of the time, where X and Y are both integers.

Some pseudo code to demonstrate a rough implementation:

Rule: 20% of dependent objects will be used 70% of the time, and the other 80% will be used the other 30% of the time

if random.randint(1,100) <= 20:
    # randomly select from the first 70% of suitable domain objects
    # this can be done by constructing an appropriate query
    # for example, a subquery could be nested in the main query to calculate 70% of the total number of rows in the table of appropriate records, and only that number will be retrieved to select from
else:
    # randomly select from the last 30% of suitable domain objects
    # construct appropriate query as described above

Note that the above is just one way of quantifying and implementing this idea, and any other reasonable solution would be ok.

TESTING

The nature of the testing will depend on the form of implementation, but a test should be added for each dependent attribute of a domain object to check the distribution is as expected. Rather than asserting a True/False statement, it might make more sense to sum how often each dependency attribute is referenced and seeing if that matches what would be expected.

Documentation Changes

Method docstrings and project readme should be updated to explain new functionality and guide user in setting config parameters

Test Evidence

Testing methodology should be implemented and should indicate that distribution weighting works as expected. All existing tests should still pass as expected.

Validation in Develop

Output from running python src/app.py should be as expected