FINRAOS / DataGenerator

DataGenerator is a Java library for systematically producing large volumes of data. DataGenerator frames data production as a modeling problem, with a user providing a model of dependencies among variables and the library traversing the model to produce relevant data sets.
http://finraos.github.io/DataGenerator
Apache License 2.0
161 stars 170 forks source link

Adding statistic feature #174

Open yukaReal opened 9 years ago

yukaReal commented 9 years ago

During hahaton I worked on statistic feature. This feature can save min / max / ... value for some property.

What do you think, do we need it in DG?

I'm thinking of something like:

<assign name="var_1" expr="#{Yuka}" statistic="max"
<assign name="var_2" expr="#{Buka}" statistic="min,average" />

With this feature, DG can not only generate data, but solve some problems!

What do you think?

If you like this, what feature / syntax / modularity can you propouse?

P.S. During hatatone I implemented calculation of max for some specific property. So, i can't just push it, it's necessary to use custom tags + min /... features + unit tests + docs + examples.

mibrahim commented 9 years ago

Good idea. I would recommend implementing it as an incremental statistics plugin.

All you need to keep track of is sum of X, sum of X^2 and the count of points.

Out of that you can calculate the mean at any time = X / count, variance = (sum of X^2) / n - mean^2 as well as being able to merge the statistics together from various mappers - in case it was executed on multiple nodes. In this case the variance is biased - i.e. computed using dividing by n instead of n-1

mibrahim commented 9 years ago

BTW, dropwizard metrics (previously yammer metrics) implement lots of stream-like statistics. https://dropwizard.github.io/metrics/3.1.0/ We might not want to use it as a dependency, but he implemented a stream median, p99, p999 ...etc that most likely we can reuse for that purpose. Look at his histogram ( https://dropwizard.github.io/metrics/3.1.0/getting-started/#histograms ).

A sample histogram result: "MyHistogram": { "type": "histogram", "count": 2041364, "min": 0, "max": 65202, "mean": 132.21681287609658, "std_dev": 444.84831733762644, "median": 44, "p75": 118, "p95": 457.39999999999964, "p98": 922.8599999999986, "p99": 1145.4000000000015, "p999": 1715.594 } },

mibrahim commented 9 years ago

Need to discuss further to define the scope.