BurntSushi / xsv

A fast CSV command line toolkit written in Rust.
The Unlicense
10.29k stars 317 forks source link

Feature Request: be able to specify sample-size as a percentage #257

Closed jqnatividad closed 2 years ago

jqnatividad commented 3 years ago

If sample-size is between 0 and 1 exclusive, it will be treated as a percentage of the total rowcount of the CSV.

So if I want my sample to be 20% of the csv. xsv sample 0.20 file.csv -o output.csv

Otherwise, if sample-size > 1, it will be treated as a rowcount, like before.

BurntSushi commented 3 years ago

Row count isn't known ahead of time, so this either would require an index to exist or two scans of the data.

jqnatividad commented 3 years ago

Thanks for the very quick response! And yes, this parameter will be ignored if an index doesn't exist.

As sample already does a <10% check when an index exists, I thought it'd be convenient for the user to specify a percentage for the sample size.

BurntSushi commented 3 years ago

As sample already does a <10% check when an index exists

Yes, but that's strictly an optimization. What you're asking for here has UX consequences.

And yes, this parameter will be ignored if an index doesn't exist.

I absolutely would never do that. It's really bad UX to allow the user to supply a parameter and have the tool silently ignore it. It's a silent failure mode. I think these are our choices:

  1. Don't add the flag.
  2. Return an error if the flag is used and no index is present.
  3. Automatically create the index if the flag is used and no index is present.
  4. Do two passes over the data if the flag is provided and no index is present.

Of those options, I think only (3) is one I wouldn't want to do. There is no precedent for it elsewhere in the tool. So I'd rather preserve the property that the index is only created when the user requests it.

(2) is the simplest to implement.

(4) is better from a "just do what I tell you to do" perspective, but will likely silently be slower than what the user might expect. (Which probably only matters for large data.) (4) would also fail if the input is a stream (or else it would have to buffer the stream in memory).

jqnatividad commented 3 years ago

My vote goes to (2). Not only is it simplest to implement, it sticks with best practice as xsv has become an important part of data-wrangling toolchains.

jqnatividad commented 3 years ago

Did an initial implementation as my project requires it

https://github.com/jqnatividad/xsv/commit/0e075a6ee14a1c883e769d28af31653b81d3f34d#diff-7e31ffcd9a51a550eee0bc9e6cb43821e91847b1476be465b563f5adea6418e6

Right now, I always create an index before using any other xsv operation, so I'm not doing an index check.

Will submit a proper PR when time allows.

jqnatividad commented 2 years ago

Closing as qsv now does this using the 2nd approach.