Closed elazarl closed 9 years ago
Now that I think of it. getting the nth bucket to the 100/n percentile would be a better idea.
Interesting thoughts. Logarithmic scale definitely seems like a useful addition. Percentiles wouldn't be useful for bucketing, but does seem like a useful way to trim a dataset (ie: --max=95% to cut out the top 5% or --min=95% if you wanted to only see that 5% of outliers.).
Typically in your situation I drill in by adjusting min/max/buckets, so something like --min=200 --max=1000 --buckets=20
for your dataset.
$ for x in {1..900}; do echo $(($(($RANDOM % 700))+250)) ; done | histogram.py -m 200 -x 1000 -b 20
# NumSamples = 900; Min = 200.00; Max = 1000.00
# Mean = 602.067778; Variance = 39393.514295; SD = 198.477994; Median 601.500000
# each ∎ represents a count of 1
200.0000 - 240.0000 [ 0]:
240.0000 - 280.0000 [ 28]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
280.0000 - 320.0000 [ 49]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
320.0000 - 360.0000 [ 59]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
360.0000 - 400.0000 [ 44]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
400.0000 - 440.0000 [ 60]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
440.0000 - 480.0000 [ 49]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
480.0000 - 520.0000 [ 51]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
520.0000 - 560.0000 [ 53]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
560.0000 - 600.0000 [ 55]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
600.0000 - 640.0000 [ 59]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
640.0000 - 680.0000 [ 55]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
680.0000 - 720.0000 [ 42]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
720.0000 - 760.0000 [ 57]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
760.0000 - 800.0000 [ 48]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
800.0000 - 840.0000 [ 53]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
840.0000 - 880.0000 [ 48]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
880.0000 - 920.0000 [ 52]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
920.0000 - 960.0000 [ 38]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
960.0000 - 1000.0000 [ 0]:
Just FTR, here your uniform distribution with logarithmic scale:
→ for x in {1..900}; do echo $(($(($RANDOM % 700))+250)) ; done | ~/dev/data_hacks/data_hacks/histogram.py -l
# NumSamples = 900; Min = 250.00; Max = 949.00
# Mean = 607.512222; Variance = 39168.714295; SD = 197.910875; Median 608.500000
# each ∎ represents a count of 6
250.0000 - 250.6833 [ 3]:
250.6833 - 252.0499 [ 1]:
252.0499 - 254.7830 [ 2]:
254.7830 - 260.2493 [ 3]:
260.2493 - 271.1818 [ 8]: ∎
271.1818 - 293.0469 [ 25]: ∎∎∎∎
293.0469 - 336.7771 [ 46]: ∎∎∎∎∎∎∎
336.7771 - 424.2375 [ 117]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
424.2375 - 599.1584 [ 232]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
599.1584 - 949.0000 [ 463]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
And here are my outliers with logarithmic scale:
$ ~/a.out 10000|~/histogram.py -l -p
# NumSamples = 10000; Min = 6608.00; Max = 834100.00
# Mean = 7718.910500; Variance = 257240392.836090; SD = 16038.715436; Median 6675.000000
# each ∎ represents a count of 128
6608.0000 - 7416.8876 [ 9635]: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎ (96.35%)
7416.8876 - 9034.6628 [ 133]: ∎ (1.33%)
9034.6628 - 12270.2131 [ 13]: (0.13%)
12270.2131 - 18741.3138 [ 43]: (0.43%)
18741.3138 - 31683.5152 [ 133]: ∎ (1.33%)
31683.5152 - 57567.9179 [ 0]: (0.00%)
57567.9179 - 109336.7234 [ 29]: (0.29%)
109336.7234 - 212874.3343 [ 4]: (0.04%)
212874.3343 - 419949.5562 [ 6]: (0.06%)
419949.5562 - 834100.0000 [ 4]: (0.04%)
Note that >95% from the results are now in a much narrower bucket.
awesome!
When I'm having many outliers, I often get histograms like:
Not helpful. I see, I have outliers, but how is the distribution inside the first bucket? It is the most important one, and I want to understand what's there.
What I want is, logarithmic histogram, like dtrace shows. Double the distance at every buckets.
Can I send a PR?