brayniac / histogram

rust histogram and percentile stats
Apache License 2.0
46 stars 4 forks source link

Added decrement functionality, with tests. #14

Closed wictory closed 8 years ago

wictory commented 8 years ago

I added fn decrement(), and fn derecord() function to undo increments and records to the histogram. I also added som tests to show that the functions works as intended. Unfortunately, the tests needs to pull in rand for more senseful tests.

I need this functionality to keep track of window distributions of time series.

brayniac commented 8 years ago

I have objections to rand being pulled in, especially for tests. What are you trying to accomplish in these tests? Would just testing specific cases be sufficient?

record() and derecord() are now nearly copy-pastes of the same code with some slight changes. Can we reduce the duplication here without changing the API?

I'd also like to know a bit more about your use case. Typically I've seen histograms being latched and computed over time intervals. Eg, histogram is created, data is inserted until the duration is up, the metrics are calculated, and the histogram is cleared again for the next window. Does a similar approach work for your use? Or do you need a sliding window?

wictory commented 8 years ago

Hi Brian!

I understand your concern. I guess the tests could be run on some more specific cases, I have to see exactly what does would be. The good thing with using a prng is that there is a very small probability that there is a pattern that happens to work.

One way to get rid of fn derecord() would be to let fn record() take an i64 and based on the sign decide if fn saturation_add() or fn saturation_sub() should be performed. Another way would be to let fn record() take a fourth argument telling the incrementation or decrementation should be performed. What do you think?

My use case is that I want to keep track of a distribution of a window of a time series. This have be done online and if the histogram is cleared after each window there is a unwanted cold period. This means that I need a sliding window.

brayniac commented 8 years ago

I'd rather we test with fixed cases - not random data. The good news is that it seems we just need to verify the behavior of decreasing the count of a bucket with some count or no count. And with values both inside and outside of the storable range for the histogram. Let me know if you believe I've overlooked something here.

Have you looked at https://crates.io/crates/heatmap ? It's a time-series of Histograms which has similar API to Histogram, but takes a timestamp. If you're dealing with streaming data, perhaps Heatmap can be extended to use a wheel of Histograms rather than just a linear vector? It seems like the decision would depend on if you are looking for fixed time dimension or fixed count of samples. Heatmap takes the fixed time dimension approach.

wictory commented 8 years ago

Yes, you are right. Maybe this much testing is not even needed. I wrote the tests to convince myself of that there would not be some weird things would start to happen outside of the linear region, since it is possible to increment on one value x in the non-linear region and decrement on another value y in its neighborhood and get the same result as decrementing on the same value x. This is a property which should be reflected at least in the documentation, but maybe also in the API.

Heatmap is not the same thing. In my case the time is implicit, and the "current" histogram is moving along in time. Heatmap would be interesting for storing the evolution of the histograms, but it is not feasible space-wise in this use case.

brayniac commented 8 years ago

See my inline notes on the code itself. +1 on making the behavior in the non-linear region clear in the docs.

I'm curious to know more about what you're doing with this. I've been pondering writing some other metrics libraries. Knowing what other people are doing with this stuff is always a help.

wictory commented 8 years ago

So in one case I'm looking on I have a sequence or process of execution times, which are self similar. People often use linear predictors for predicting stochastic processes similar to these, but when the question is about execution times, I'm interested in the probability that the next execution time is above (or below) some value. So, maybe my estimate for the next execution time is a value which I "know" that is below the next execution time with probability 99.9%, for this histograms are perfect. Now if I have a long running process, and the long term behaviour of the execution time is slowly changing with time I don't want to keep track of the execution time from the beginning of the process, so therefore I need a window.

brayniac commented 8 years ago

@wictory - whenever you're ready for final review, just squash the commits into 1 meaningful one. I'll be happy to merge if everything is looking good.

wictory commented 8 years ago

It's ready for final review now. I added a comment on the first page of the documentation about the precision of the values.

brayniac commented 8 years ago

@wictory - new version is published to crates.io now. Thanks for contributing!