BurntSushi / xsv

A fast CSV command line toolkit written in Rust.
The Unlicense
10.29k stars 317 forks source link

a seeded sample not working properly when an index is present #255

Closed jqnatividad closed 2 years ago

jqnatividad commented 3 years ago

I have a somewhat sparse tsv file with 45m rows, 24 columns - about 4.3 gb.

When I run xsv sample --seed 42 1000 file.tsv -o output.csv without an index, it takes about 15 seconds and produces a reproducible sample.

However, when I create an index (xsv index file.tsv - takes about 12 seconds, producing a 350mb IDX file), and run a sample using the same seed, it is fast (2 seconds), but produces a different sample for each run, as if I didn't specify a seed.

jqnatividad commented 3 years ago

After looking at the code, this partially explains the issue https://github.com/BurntSushi/xsv/blob/3de6c04269a7d315f7e9864b9013451cd9580a08/src/cmd/sample.rs#L17-L19

basically, short circuiting the seed parameter.

I ran a bigger sample( more than 10% ) using the same tsv file (xsv sample --seed 42 5000000 file.tsv -o output2.csv), and I can confirm its now reproducible, with and without an index.

However, why does it ignore the seed parameter only when an index is present when the sample size is less than 10%? Shouldn't seed always takes precedence over the <10% sample size check?