yields the file below in only 3.60 seconds. The only difference being we didn't use the --infer-dates option and date fields and their min/max values are treated as Strings.
Clearly, --infer-dates is a very expensive operation, and understandably so, since qsv's date parser engine has to parse and recognize 15 different date formats, with each format having several permutations.
Currently, DP+ uses the --infer-dates option during its analysis phase, which is something I'd still like to keep as its very useful when it does infer a column is a date field.
Perhaps, we should only attempt to infer dates when a quick initial scan of the CSV headers suggest the presence of a date field (i.e. search for the presence of "date", "time", "timestamp", "datetime" anywhere in a column name)?
With
qsv stats
we collect descriptive statistics when we infer each column's data type during the Analysis phase of a DP+ job.For example, using the benchmark data from qsv based on a 1M row , 512 mb, 41 column sample of NYC's 311 data, the command:
yields the file below in 0.27 seconds:
Adding the
--everything
and--infer-dates
options...yields the file below in 103.89 seconds. More than 3 orders of magnitude slower!
while the command:
yields the file below in only 3.60 seconds. The only difference being we didn't use the
--infer-dates
option and date fields and their min/max values are treated asString
s.Clearly,
--infer-dates
is a very expensive operation, and understandably so, since qsv's date parser engine has to parse and recognize 15 different date formats, with each format having several permutations.Currently, DP+ uses the
--infer-dates
option during its analysis phase, which is something I'd still like to keep as its very useful when it does infer a column is a date field.Perhaps, we should only attempt to infer dates when a quick initial scan of the CSV headers suggest the presence of a date field (i.e. search for the presence of "date", "time", "timestamp", "datetime" anywhere in a column name)?