BurntSushi / xsv

A fast CSV command line toolkit written in Rust.
The Unlicense
10.35k stars 322 forks source link

Wrong results with frequency #49

Open patricebellan opened 7 years ago

patricebellan commented 7 years ago

Hi,

I've been playing with frequency a bit, but got odd results, and here is what I found.

Take this simple csv file

alpha,num
"a",1
"b",2
"b",1
"a",2

run xsv frequency -s alpha,num

result: field,value,count alpha,b,2 alpha,a,2 num,2,2 num,1,2

then run xsv frequency -s num,alpha

result: field,value,count num,a,2 num,b,2 alpha,1,2 alpha,2,2

I get the correct result only if I put the fields in the same order as the file header

hmalphettes commented 7 years ago

I am also getting some odd results on a small csv file with a single column and forcing the number of jobs to 1.

Header
""
a
b
c
d
e
f
g
h
i
j
k

Running a few times xsv frequency -s Header t3.csv -j 1 | sort gives me different results with 1 record missing. Not always the same record. That happens both with 0.12.2 as installed by Homebrew and with a debug build of the current tip of the source code with rustc-1.19.0

I am attaching a screenshot of my terminal with those ad-hoc tests. First run we are missing the value (NULL). Second run we are missing the value e.

frequency-results

hmalphettes commented 7 years ago

When I debug the Frequencies I do see the proper cardinality of values and content.

// https://github.com/BurntSushi/xsv/blob/master/src/cmd/frequency.rs#L89
for (i, (header, ftab)) in head_ftables.enumerate() {
  println!("{:?}: {:?}", &ftab, &ftab.cardinality());
  let mut header = header.to_vec();
  ...

does print the expected hash-map and proper cardinality:

{[107]: 1, [102]: 1, [106]: 1, [97]: 1, [103]: 1, [105]: 1, []: 1, [98]: 1, [104]: 1, [100]: 1, [101]: 1, [99]: 1}: 12

It seems counts.into_iter() (https://github.com/BurntSushi/xsv/blob/master/src/cmd/frequency.rs#L124) is where we loose some items.

I am a real newbie at rust and did not spend enough time to grasp what is going on there. I hope this helps.

sonicdoe commented 3 years ago

I’ve also been surprised about some missing items when using frequency until I found out about its --limit option. By default, this option is set to 10 so you’ll never see more than 10 items.

Of course, this doesn’t explain all of the issues mentioned above but is hopefully helpful for some.

jawj commented 10 months ago

@sonicdoe Thanks — just got tripped up by this myself.

@BurntSushi IMHO if --limit is going to eliminate some categories by default, it absolutely should not do so silently. For example, it could print + 12 equally-or-less frequent values with total count 12345678 after the table. If we then need another switch to suppress that message, so be it.