Closed jqnatividad closed 2 years ago
After looking at the code, this partially explains the issue https://github.com/BurntSushi/xsv/blob/3de6c04269a7d315f7e9864b9013451cd9580a08/src/cmd/sample.rs#L17-L19
basically, short circuiting the seed
parameter.
I ran a bigger sample( more than 10% ) using the same tsv file (xsv sample --seed 42 5000000 file.tsv -o output2.csv
), and I can confirm its now reproducible, with and without an index.
However, why does it ignore the seed parameter only when an index is present when the sample size is less than 10%? Shouldn't seed always takes precedence over the <10% sample size check?
I have a somewhat sparse tsv file with 45m rows, 24 columns - about 4.3 gb.
When I run
xsv sample --seed 42 1000 file.tsv -o output.csv
without an index, it takes about 15 seconds and produces a reproducible sample.However, when I create an index (
xsv index file.tsv
- takes about 12 seconds, producing a 350mb IDX file), and run a sample using the same seed, it is fast (2 seconds), but produces a different sample for each run, as if I didn't specify a seed.