Derive numeric/datetime formats during profiling

NeilMiles commented 6 years ago

A naive solution would flexibly parse the input data and then forget the original format, but it undermines the faithfulness of the data if, eg, input dates are expressed as 23/11/2013 but our sample data has 2013-11-23.

Dates are relatively simple to solve since it's usually easy to unambiguously deduce the format from a string. Numbers might be more awkward since you might need to examine multiple cases to derive the full formatting rules (eg, if there are non-fixed-length fractional components). Some possibilities would be especially painful (eg, if input has fixed number of significant figures).

What if formats vary? Should we output in a comparable distribution, or just choose the most populous/recent?

ghost commented 5 years ago

As at 19/12/2018:

Potential duplicate of #59
Priority: Medium-High - Temporal fields are likely to be prevalent within data sources used by people considering the use of the Profiler when at MVP stage.
Complexity: Unknown - How does this differ from #59.
1. Conceptually there will be a complex element around different date/time formats that can be emitted, i.e. dd/MM/yyyy vs. MM/dd/yyyy where all dd <= 12. e.g. 10/11/2001, is this 10th Nov 2001 or 11th Oct 2001?
2. What happens when dates are described in different locales, different ordering, delimiters and names for months can be used. (e.g. 1 Janvier 2019)
3. Need to consider the locale of the computer on which the generator is running, but also have a configurable element to vary the known local for the Profiler given the provided data.
4. What should happen for shortened dates, e.g. 1/1/19 (it could represent 1 Jan 2019, or be something else)

ghost commented 5 years ago

Superseded by https://github.com/ScottLogic/Data-Engineering-Profiler/issues/3

finos / datahelix

Derive numeric/datetime formats during profiling #65