finos / datahub

DataHub - Synthetic data library
https://datahub.finos.org
Apache License 2.0
79 stars 13 forks source link

DataHub and DataHelix to create a central area of requirements for Synthetic Data Generation #30

Closed mcleo-d closed 4 years ago

mcleo-d commented 4 years ago

cc @andrewcarrblue and @grovesy

Description

The DataHub and DataHelix projects would like to create a central area of requirements for Synthetic Data Generation. This issue has been created to define the requirements for a central workspace.

Success Criteria

mcleo-d commented 4 years ago

@andrewcarrblue and @grovesy,

The following code snippet creates a table that can be used by DataHub and DataHelix to create initial, individual sets of synthetic data generation requirements.

Please use as required for DataHelix and DataHub, within the comments of this issue, before a central project area is created for joint requirements gathering.

### DataHelix _or_ DataHub Requirements (remove project name where required)
The following is a table of DataHelix or DataHub prioritised project requirements, including where the issues have been raised on project backlogs.

| Requirement | Description |  Project | Priority | GitHub Issue Created |
|---------- |------------- |----------|----------|----------|
| First Requirement | Example requirement description | [DataHub](https://github.com/finos/datahub) | High | <ul><li>[ ] #30</li></ul> |
| Second Requirement | Example requirement description | [DataHelix](https://github.com/finos/datahelix) | Medium | <ul><li>[x] #30</li></ul> |
| Third Requirement | Example requirement description | [DataHub](https://github.com/finos/datahub) | Low | <ul><li>[ ] #30</li></ul> |

Renders as below ...

DataHelix or DataHub Requirements (remove project name where required)

The following is a table of DataHelix or DataHub prioritised project requirements, including where the issues have been raised on project backlogs.

Requirement Description Project Priority GitHub Issue Created
First Requirement Example requirement description DataHub High
  • [ ] #30
Second Requirement Example requirement description DataHelix Medium
  • [x] #30
Third Requirement Example requirement description DataHub Low
  • [ ] #30
andrewcarrblue commented 4 years ago

Use Cases

Requirements for a Synthetic Data Generator

Requirements around generating “similar” reasonably realistic data

Requirement Description Priority GitHub Issue
Ability to look at some given data, maybe production data and observe some profile, shape and characteristics of that data. Which can then be used to generate some “life like” synthetic data
Ability to review the profile, shape and characteristics of the data observed, so sensitive enumerations, shapes, profiles, names etc can be removed, modified or changed
Characteristics of data that would be good to be observable
  • (Basic) Type of column – integer, float, string
  • Whether the column is an enum (enumeration) – a set of values from a given set
  • Whether the column is freeform text, and if so, what other characteristics does it have
  • Average length, min length, max length
  • Valid characters
  • Control characters (including end of line/return)
Whether there are any relationships between the columns (combinations of valid enumerations, values related to enumerations)
  • Any more advanced distributions in value columns, Poisson, normal and other distributions
  • Advanced recognition of data types
  • Recognising financial services specific values such as RICs, ISINs, CUSIPs
  • Ability to generate custom field values based on code

Requirements around generating “hyper” realistic data

Requirement Description Priority GitHub Issue
Components with the ability to add code to generate data
The code to be able to access business logic to generate data which matches given rules (maybe even share business logic with code from the system?)
mcleo-d commented 4 years ago

@grovesy and @andrewcarrblue,

Thanks for meeting earlier and thanks for the contribution above @andrewcarrblue 👍

I've taken the original bullets and have produced the MD table above using your bullets as the Description and adding the additional columns for you and @grovesy to work through and prioritise.

The Requirement column is the title of the item for quick reference, if you're able to add?

James.

mcleo-d commented 4 years ago

@grovesy, @andrewcarrblue, @kimmyyoo and @BenFielding

On Monday 27th July we have the next DataHub and DataHelix standup. Do you have any requirements or issues you'd like to discuss so they can be added to the agenda?

Keep in mind, our initial focus is to prioritise industrial requirements so we can move into continuous development between both projects.

Many thanks 🤜 💥 🚀

James.

andrewcarrblue commented 4 years ago

To help us have a common language, would be great for us to agree terminology, use cases

Types of data

Type of event data

Use Cases

Attirbutes

BenFielding commented 4 years ago

Requirements

Additional project requirements from Gensyn perspective (highly representative synthetic data).

Requirement Description GitHub Issue Created
Rules to pick generative models Discover statistical properties that determine which generative models can re-create the source data most effectively. Particularly data volume. Key properties: Distribution, Sparsity, Discrete vs continuous, and Volume.
  • [ ]
Similarity-analysis How closely does the generated data resemble the real data, does this prevent a privacy risk and can this be quantified? A common approach is Distance to Closest Record using Euclidean Distance. Here.
  • [ ]
Missing value specification Are missing values in a column MCAR, MAR, or MNAR? Important when using generative models and listwise deletion.
  • [ ]

This also comes alongside an additional use case that we'd add to the above from @andrewcarrblue, which is:

This would be used by a machine learning researcher/developer/engineer to perform exploratory data analysis (EDA) and to build a prototype model, before deploying over a federated learning infrastructure (e.g. as provided by Gensyn).

mcleo-d commented 4 years ago

Thanks for adding the requirements @BenFielding.

I'm now adding to a markdown document and will link the PR here.

James.

mcleo-d commented 4 years ago

@andrewcarrblue and @BenFielding ,

I've taken the following comments and have created #39 to help groom and prioritise the requirements before taking into development.

Please feedback on the PR and documents directly to help with the scaling of this issue.