The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation
A naive solution would flexibly parse the input data and then forget the original format, but it undermines the faithfulness of the data if, eg, input dates are expressed as 23/11/2013 but our sample data has 2013-11-23.
Dates are relatively simple to solve since it's usually easy to unambiguously deduce the format from a string. Numbers might be more awkward since you might need to examine multiple cases to derive the full formatting rules (eg, if there are non-fixed-length fractional components). Some possibilities would be especially painful (eg, if input has fixed number of significant figures).
What if formats vary? Should we output in a comparable distribution, or just choose the most populous/recent?
Priority: Medium-High - Temporal fields are likely to be prevalent within data sources used by people considering the use of the Profiler when at MVP stage.
Complexity: Unknown - How does this differ from #59.
Conceptually there will be a complex element around different date/time formats that can be emitted, i.e. dd/MM/yyyy vs. MM/dd/yyyy where all dd <= 12. e.g. 10/11/2001, is this 10th Nov 2001 or 11th Oct 2001?
What happens when dates are described in different locales, different ordering, delimiters and names for months can be used. (e.g. 1 Janvier 2019)
Need to consider the locale of the computer on which the generator is running, but also have a configurable element to vary the known local for the Profiler given the provided data.
What should happen for shortened dates, e.g. 1/1/19 (it could represent 1 Jan 2019, or be something else)
A naive solution would flexibly parse the input data and then forget the original format, but it undermines the faithfulness of the data if, eg, input dates are expressed as 23/11/2013 but our sample data has 2013-11-23.
Dates are relatively simple to solve since it's usually easy to unambiguously deduce the format from a string. Numbers might be more awkward since you might need to examine multiple cases to derive the full formatting rules (eg, if there are non-fixed-length fractional components). Some possibilities would be especially painful (eg, if input has fixed number of significant figures).
What if formats vary? Should we output in a comparable distribution, or just choose the most populous/recent?