Closed jqnatividad closed 2 years ago
Row count isn't known ahead of time, so this either would require an index to exist or two scans of the data.
Thanks for the very quick response! And yes, this parameter will be ignored if an index doesn't exist.
As sample already does a <10% check when an index exists, I thought it'd be convenient for the user to specify a percentage for the sample size.
As sample already does a <10% check when an index exists
Yes, but that's strictly an optimization. What you're asking for here has UX consequences.
And yes, this parameter will be ignored if an index doesn't exist.
I absolutely would never do that. It's really bad UX to allow the user to supply a parameter and have the tool silently ignore it. It's a silent failure mode. I think these are our choices:
Of those options, I think only (3) is one I wouldn't want to do. There is no precedent for it elsewhere in the tool. So I'd rather preserve the property that the index is only created when the user requests it.
(2) is the simplest to implement.
(4) is better from a "just do what I tell you to do" perspective, but will likely silently be slower than what the user might expect. (Which probably only matters for large data.) (4) would also fail if the input is a stream (or else it would have to buffer the stream in memory).
My vote goes to (2). Not only is it simplest to implement, it sticks with best practice as xsv has become an important part of data-wrangling toolchains.
Did an initial implementation as my project requires it
Right now, I always create an index before using any other xsv operation, so I'm not doing an index check.
Will submit a proper PR when time allows.
Closing as qsv now does this using the 2nd approach.
If sample-size is between 0 and 1 exclusive, it will be treated as a percentage of the total rowcount of the CSV.
So if I want my sample to be 20% of the csv.
xsv sample 0.20 file.csv -o output.csv
Otherwise, if sample-size > 1, it will be treated as a rowcount, like before.