finos / datahelix

The DataHelix generator allows you to quickly create data, based on a JSON profile that defines fields and the relationships between them, for the purpose of testing and validation
https://finos.github.io/datahelix/
Apache License 2.0
141 stars 50 forks source link

Derive numeric/datetime formats during profiling #65

Closed NeilMiles closed 5 years ago

NeilMiles commented 6 years ago

A naive solution would flexibly parse the input data and then forget the original format, but it undermines the faithfulness of the data if, eg, input dates are expressed as 23/11/2013 but our sample data has 2013-11-23.

Dates are relatively simple to solve since it's usually easy to unambiguously deduce the format from a string. Numbers might be more awkward since you might need to examine multiple cases to derive the full formatting rules (eg, if there are non-fixed-length fractional components). Some possibilities would be especially painful (eg, if input has fixed number of significant figures).

What if formats vary? Should we output in a comparable distribution, or just choose the most populous/recent?

ghost commented 5 years ago

As at 19/12/2018:

ghost commented 5 years ago

Superseded by https://github.com/ScottLogic/Data-Engineering-Profiler/issues/3