Closed mcleo-d closed 4 years ago
@andrewcarrblue and @grovesy,
The following code snippet creates a table that can be used by DataHub and DataHelix to create initial, individual sets of synthetic data generation requirements.
Please use as required for DataHelix and DataHub, within the comments of this issue, before a central project area is created for joint requirements gathering.
### DataHelix _or_ DataHub Requirements (remove project name where required)
The following is a table of DataHelix or DataHub prioritised project requirements, including where the issues have been raised on project backlogs.
| Requirement | Description | Project | Priority | GitHub Issue Created |
|---------- |------------- |----------|----------|----------|
| First Requirement | Example requirement description | [DataHub](https://github.com/finos/datahub) | High | <ul><li>[ ] #30</li></ul> |
| Second Requirement | Example requirement description | [DataHelix](https://github.com/finos/datahelix) | Medium | <ul><li>[x] #30</li></ul> |
| Third Requirement | Example requirement description | [DataHub](https://github.com/finos/datahub) | Low | <ul><li>[ ] #30</li></ul> |
Renders as below ...
The following is a table of DataHelix or DataHub prioritised project requirements, including where the issues have been raised on project backlogs.
Requirement | Description | Project | Priority | GitHub Issue Created |
---|---|---|---|---|
First Requirement | Example requirement description | DataHub | High |
|
Second Requirement | Example requirement description | DataHelix | Medium |
|
Third Requirement | Example requirement description | DataHub | Low |
|
Requirement | Description | Priority | GitHub Issue |
---|---|---|---|
Ability to look at some given data, maybe production data and observe some profile, shape and characteristics of that data. Which can then be used to generate some “life like” synthetic data | |||
Ability to review the profile, shape and characteristics of the data observed, so sensitive enumerations, shapes, profiles, names etc can be removed, modified or changed | |||
Characteristics of data that would be good to be observable
|
|||
Whether there are any relationships between the columns (combinations of valid enumerations, values related to enumerations)
|
Requirement | Description | Priority | GitHub Issue |
---|---|---|---|
Components with the ability to add code to generate data | |||
The code to be able to access business logic to generate data which matches given rules (maybe even share business logic with code from the system?) |
@grovesy and @andrewcarrblue,
Thanks for meeting earlier and thanks for the contribution above @andrewcarrblue 👍
I've taken the original bullets and have produced the MD table above using your bullets as the Description and adding the additional columns for you and @grovesy to work through and prioritise.
The Requirement column is the title of the item for quick reference, if you're able to add?
James.
@grovesy, @andrewcarrblue, @kimmyyoo and @BenFielding
On Monday 27th July we have the next DataHub and DataHelix standup. Do you have any requirements or issues you'd like to discuss so they can be added to the agenda?
Keep in mind, our initial focus is to prioritise industrial requirements so we can move into continuous development between both projects.
Many thanks 🤜 💥 🚀
James.
To help us have a common language, would be great for us to agree terminology, use cases
Types of data
Type of event data
Use Cases
Attirbutes
Additional project requirements from Gensyn perspective (highly representative synthetic data).
Requirement | Description | GitHub Issue Created |
---|---|---|
Rules to pick generative models | Discover statistical properties that determine which generative models can re-create the source data most effectively. Particularly data volume. Key properties: Distribution, Sparsity, Discrete vs continuous, and Volume. |
|
Similarity-analysis | How closely does the generated data resemble the real data, does this prevent a privacy risk and can this be quantified? A common approach is Distance to Closest Record using Euclidean Distance. Here. |
|
Missing value specification | Are missing values in a column MCAR, MAR, or MNAR? Important when using generative models and listwise deletion. |
|
This also comes alongside an additional use case that we'd add to the above from @andrewcarrblue, which is:
This would be used by a machine learning researcher/developer/engineer to perform exploratory data analysis (EDA) and to build a prototype model, before deploying over a federated learning infrastructure (e.g. as provided by Gensyn).
Thanks for adding the requirements @BenFielding.
I'm now adding to a markdown document and will link the PR here.
James.
@andrewcarrblue and @BenFielding ,
I've taken the following comments and have created #39 to help groom and prioritise the requirements before taking into development.
Please feedback on the PR and documents directly to help with the scaling of this issue.
cc @andrewcarrblue and @grovesy
Description
The DataHub and DataHelix projects would like to create a central area of requirements for Synthetic Data Generation. This issue has been created to define the requirements for a central workspace.
Success Criteria