clarity20 / tql

Terse Query Language
1 stars 0 forks source link

Implement comprehensive column-name inference rules #30

Open clarity20 opened 5 years ago

clarity20 commented 5 years ago

The most general question we could try to address is this: When & how can we reasonably guess the column name for a given datum or list of data?

Right now, column names are inferred using two configurable settings, defaultAlnumCols and defaultNumCols, the former to guess the name for alphabetic values and the latter for integer values. Each of these can be customized with upper and lower bounds on the string-length of the values. This feature could be augmented by adding similar configurability for floating-point values, date values, et cetera.

An even more comprehensive improvement would be to allow a prioritized list of inference rules. Each rule would consist of a regular expression to match the value (or the value list) against and a (list of) column names to use if the match is successful. The existing default numeric and alpha column names are actually special cases of this. This scheme is challenging and gets even trickier when we try to impose length bounds at one and the same time.

This is an open problem overall. Most importantly, this feature is pretty good as-is for the most common use cases, namely, data-lookup by primary key id or by name. When it comes to extending this inference engine, I do not have a plan of attack that I'm convinced would lead to the greatest range of correct outcomes without costing too much effort. So this will be filed as a nice-to-have.