bitly / data_hacks

Command line utilities for data analysis
http://github.com/bitly/data_hacks
1.94k stars 195 forks source link

histogram.py switch for logarthmic buckets #24

Closed elazarl closed 9 years ago

elazarl commented 9 years ago

When I'm having many outliers, I often get histograms like:

$ time (./a.out 100000|histogram.py -b 10)
# NumSamples = 100000; Min = 237.00; Max = 37599.00
# Mean = 321.560610; Variance = 64719.622326; SD = 254.400516; Median 303.000000
# each ∎ represents a count of 1333
  237.0000 -  3973.2000 [ 99993]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
 3973.2000 -  7709.4000 [     0]: 
 7709.4000 - 11445.6000 [     1]: 
11445.6000 - 15181.8000 [     0]: 
15181.8000 - 18918.0000 [     0]: 
18918.0000 - 22654.2000 [     0]: 
22654.2000 - 26390.4000 [     3]: 
26390.4000 - 30126.6000 [     1]: 
30126.6000 - 33862.8000 [     0]: 
33862.8000 - 37599.0000 [     2]: 

Not helpful. I see, I have outliers, but how is the distribution inside the first bucket? It is the most important one, and I want to understand what's there.

What I want is, logarithmic histogram, like dtrace shows. Double the distance at every buckets.

Can I send a PR?

elazarl commented 9 years ago

Now that I think of it. getting the nth bucket to the 100/n percentile would be a better idea.

jehiah commented 9 years ago

Interesting thoughts. Logarithmic scale definitely seems like a useful addition. Percentiles wouldn't be useful for bucketing, but does seem like a useful way to trim a dataset (ie: --max=95% to cut out the top 5% or --min=95% if you wanted to only see that 5% of outliers.).

Typically in your situation I drill in by adjusting min/max/buckets, so something like --min=200 --max=1000 --buckets=20 for your dataset.

$ for x in  {1..900}; do echo $(($(($RANDOM % 700))+250)) ; done | histogram.py -m 200 -x 1000 -b 20
# NumSamples = 900; Min = 200.00; Max = 1000.00
# Mean = 602.067778; Variance = 39393.514295; SD = 198.477994; Median 601.500000
# each ∎ represents a count of 1
  200.0000 -   240.0000 [     0]: 
  240.0000 -   280.0000 [    28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  280.0000 -   320.0000 [    49]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  320.0000 -   360.0000 [    59]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  360.0000 -   400.0000 [    44]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  400.0000 -   440.0000 [    60]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  440.0000 -   480.0000 [    49]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  480.0000 -   520.0000 [    51]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  520.0000 -   560.0000 [    53]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  560.0000 -   600.0000 [    55]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  600.0000 -   640.0000 [    59]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  640.0000 -   680.0000 [    55]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  680.0000 -   720.0000 [    42]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  720.0000 -   760.0000 [    57]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  760.0000 -   800.0000 [    48]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  800.0000 -   840.0000 [    53]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  840.0000 -   880.0000 [    48]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  880.0000 -   920.0000 [    52]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  920.0000 -   960.0000 [    38]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  960.0000 -  1000.0000 [     0]: 
elazarl commented 9 years ago

Just FTR, here your uniform distribution with logarithmic scale:

→ for x in  {1..900}; do echo $(($(($RANDOM % 700))+250)) ; done | ~/dev/data_hacks/data_hacks/histogram.py -l
# NumSamples = 900; Min = 250.00; Max = 949.00
# Mean = 607.512222; Variance = 39168.714295; SD = 197.910875; Median 608.500000
# each ∎ represents a count of 6
  250.0000 -   250.6833 [     3]: 
  250.6833 -   252.0499 [     1]: 
  252.0499 -   254.7830 [     2]: 
  254.7830 -   260.2493 [     3]: 
  260.2493 -   271.1818 [     8]: ∎
  271.1818 -   293.0469 [    25]: ∎∎∎∎
  293.0469 -   336.7771 [    46]: ∎∎∎∎∎∎∎
  336.7771 -   424.2375 [   117]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  424.2375 -   599.1584 [   232]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  599.1584 -   949.0000 [   463]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎

And here are my outliers with logarithmic scale:

$ ~/a.out 10000|~/histogram.py -l -p
# NumSamples = 10000; Min = 6608.00; Max = 834100.00
# Mean = 7718.910500; Variance = 257240392.836090; SD = 16038.715436; Median 6675.000000
# each ∎ represents a count of 128
 6608.0000 -  7416.8876 [  9635]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (96.35%)
 7416.8876 -  9034.6628 [   133]: ∎ (1.33%)
 9034.6628 - 12270.2131 [    13]:  (0.13%)
12270.2131 - 18741.3138 [    43]:  (0.43%)
18741.3138 - 31683.5152 [   133]: ∎ (1.33%)
31683.5152 - 57567.9179 [     0]:  (0.00%)
57567.9179 - 109336.7234 [    29]:  (0.29%)
109336.7234 - 212874.3343 [     4]:  (0.04%)
212874.3343 - 419949.5562 [     6]:  (0.06%)
419949.5562 - 834100.0000 [     4]:  (0.04%)

Note that >95% from the results are now in a much narrower bucket.

jehiah commented 9 years ago

awesome!