DataHub and DataHelix to create a central area of requirements for Synthetic Data Generation

mcleo-d commented 4 years ago

cc @andrewcarrblue and @grovesy

Description

The DataHub and DataHelix projects would like to create a central area of requirements for Synthetic Data Generation. This issue has been created to define the requirements for a central workspace.

Success Criteria

[ ] DataHub and DataHelix to add initial project requirements to this issue using the following table format for each individual project - https://github.com/finos/datahub/issues/30#issuecomment-647460426
[ ] DataHub and DataHelix to define how the teams should collaborate and how a central workspace will be maintained.
[ ] DataHub and DataHelix to define the requirements of the central workspace and whether FINOS should provide infrastructure.
[ ] DataHub and DataHelix to define how requirements will be groomed, prioritised and allocated to team members to resolve.

mcleo-d commented 4 years ago

@andrewcarrblue and @grovesy,

The following code snippet creates a table that can be used by DataHub and DataHelix to create initial, individual sets of synthetic data generation requirements.

Please use as required for DataHelix and DataHub, within the comments of this issue, before a central project area is created for joint requirements gathering.

### DataHelix _or_ DataHub Requirements (remove project name where required)
The following is a table of DataHelix or DataHub prioritised project requirements, including where the issues have been raised on project backlogs.

| Requirement | Description |  Project | Priority | GitHub Issue Created |
|---------- |------------- |----------|----------|----------|
| First Requirement | Example requirement description | [DataHub](https://github.com/finos/datahub) | High | <ul><li>[ ] #30</li></ul> |
| Second Requirement | Example requirement description | [DataHelix](https://github.com/finos/datahelix) | Medium | <ul><li>[x] #30</li></ul> |
| Third Requirement | Example requirement description | [DataHub](https://github.com/finos/datahub) | Low | <ul><li>[ ] #30</li></ul> |

Renders as below ...

DataHelix or DataHub Requirements (remove project name where required)

The following is a table of DataHelix or DataHub prioritised project requirements, including where the issues have been raised on project backlogs.

Requirement	Description	Project	Priority	GitHub Issue Created
First Requirement	Example requirement description	DataHub	High	[ ] #30
Second Requirement	Example requirement description	DataHelix	Medium	[x] #30
Third Requirement	Example requirement description	DataHub	Low	[ ] #30

andrewcarrblue commented 4 years ago

Use Cases

High volume reasonably realistic data – typically used for soak tests, volume tests,
Lower volume hyper realistic data – typically used to check system functionality

Requirements for a Synthetic Data Generator

Requirements around generating “similar” reasonably realistic data

Requirement	Description	Priority	GitHub Issue
	Ability to look at some given data, maybe production data and observe some profile, shape and characteristics of that data. Which can then be used to generate some “life like” synthetic data
	Ability to review the profile, shape and characteristics of the data observed, so sensitive enumerations, shapes, profiles, names etc can be removed, modified or changed
	Characteristics of data that would be good to be observable (Basic) Type of column – integer, float, string Whether the column is an enum (enumeration) – a set of values from a given set Whether the column is freeform text, and if so, what other characteristics does it have Average length, min length, max length Valid characters Control characters (including end of line/return)
	Whether there are any relationships between the columns (combinations of valid enumerations, values related to enumerations) Any more advanced distributions in value columns, Poisson, normal and other distributions Advanced recognition of data types Recognising financial services specific values such as RICs, ISINs, CUSIPs Ability to generate custom field values based on code

Requirements around generating “hyper” realistic data

Requirement	Description	Priority	GitHub Issue
	Components with the ability to add code to generate data
	The code to be able to access business logic to generate data which matches given rules (maybe even share business logic with code from the system?)

mcleo-d commented 4 years ago

@grovesy and @andrewcarrblue,

Thanks for meeting earlier and thanks for the contribution above @andrewcarrblue 👍

I've taken the original bullets and have produced the MD table above using your bullets as the Description and adding the additional columns for you and @grovesy to work through and prioritise.

The Requirement column is the title of the item for quick reference, if you're able to add?

James.

mcleo-d commented 4 years ago

@grovesy, @andrewcarrblue, @kimmyyoo and @BenFielding

On Monday 27th July we have the next DataHub and DataHelix standup. Do you have any requirements or issues you'd like to discuss so they can be added to the agenda?

Keep in mind, our initial focus is to prioritise industrial requirements so we can move into continuous development between both projects.

Many thanks 🤜 💥 🚀

James.

andrewcarrblue commented 4 years ago

To help us have a common language, would be great for us to agree terminology, use cases

Types of data

Reference data (slow to change). (1) Publicly available (2) Private ref data
Event data (transaction logs, web logs, trade events, etc etc)
System data (configuration data for the system itself)

Type of event data

Independent events (rows which have no dependency on each other) / Dependent events
Rules based columns (columns which have dependencies/relationships on each other)
Linked events - rows which have a connection, relationship to data external to this table

Use Cases

Volume/throughput/stress testing - high volume, reasonably realistic data
Functional/feature - low volume, super accurate/realistic data
Machine learning training - volume, realistic data, with realistic statistical shapes

Attirbutes

Statistical profiles
Rules

BenFielding commented 4 years ago

Requirements

Additional project requirements from Gensyn perspective (highly representative synthetic data).

Requirement	Description	GitHub Issue Created
Rules to pick generative models	Discover statistical properties that determine which generative models can re-create the source data most effectively. Particularly data volume. Key properties: Distribution, Sparsity, Discrete vs continuous, and Volume.	[ ]
Similarity-analysis	How closely does the generated data resemble the real data, does this prevent a privacy risk and can this be quantified? A common approach is Distance to Closest Record using Euclidean Distance. Here.	[ ]
Missing value specification	Are missing values in a column MCAR, MAR, or MNAR? Important when using generative models and listwise deletion.	[ ]

This also comes alongside an additional use case that we'd add to the above from @andrewcarrblue, which is:

Machine learning feature engineering - medium volume, realistic data, with realistic statistical shapes

This would be used by a machine learning researcher/developer/engineer to perform exploratory data analysis (EDA) and to build a prototype model, before deploying over a federated learning infrastructure (e.g. as provided by Gensyn).

mcleo-d commented 4 years ago

Thanks for adding the requirements @BenFielding.

I'm now adding to a markdown document and will link the PR here.

James.

mcleo-d commented 4 years ago

@andrewcarrblue and @BenFielding ,

I've taken the following comments and have created #39 to help groom and prioritise the requirements before taking into development.

Please feedback on the PR and documents directly to help with the scaling of this issue.

finos / datahub