benkeen / generatedata

A powerful, feature-rich, random test data generator.
https://generatedata.com
2.22k stars 611 forks source link

Normal distribution standard deviation does not seem correct #374

Open joelhoro opened 8 years ago

joelhoro commented 8 years ago

When running 'Normal distribution' with an average of 5000 and a standard deviation of 1000 I get a distribution of data that is far off. I measure a sample stdev of ~ 260 on a sample size of 100. Happy to fix the analytics - couldn't find where it is in the codebase...

conradhagemans commented 8 years ago

Hi Joel, Any help is apreciated. You can find the code: /ROOT FOLDER/plugins/dataTypes/NormalDistribution/

TMSiefers commented 8 years ago

conradhagemans,

I believe the problem is in the script for the file NormalDistribution.class.php.

joelhoro's stdev is approximately 1/4th the requested value. I used your online generator to create 10 sets with a 100 points each and the stddev is approximately 1/4th the requested values for each of those as well (requested mean of 1.5 with stddev of 0.3). Looking in the script for the file listed above, there is a line at the very end that commands:

return $this->gauss() * ($stddev / 4) + $mean;

There are also preceding notes that state

// Adjust our gaussian random to fit the mean and standard deviation // The division by 4 is an arbitrary value to help fit the distribution // within our required range, and gives a best fit for $stddev=1.0

I'm not familiar with this particular language nor am I using the appropriate software to read the script, but I'm pretty sure this "arbitrary" reduction by 1/4th is the problem.

Admittedly, it took an embarassingly long time to consider calculating the standard deviation of the exported data to find out why it wasn't matching the limiting distribution. None-the-less, thanks for creating the data generator. Hopefully you get the bugs worked out quickly.

Kind regards, Tim Siefers

benkeen commented 8 years ago

Hey @TMSiefers, thanks for the info, and sorry for the bug in the code. Not being a mathematician, when I wrote this plugin I relied on someone else's code for the generation part - it would appear that that wasn't correct (or very possibly, I misinterpreted it somehow).

I've just been wading through some wikipedia articles on this. Yikes, my math is rusty.

I'd love to fix this, but I think I need a layman-friendly version of how the generation code should actually work. I tried just removing the / 4 part you mentioned but the generated data with a mean of 0 and a S.D. of 1 is no longer between the range of -1 to 1 (which is incorrect, no?).

joelhoro commented 8 years ago

Well if stdev is 1 that means that roughly 68% of the population lies between mean-stdev and mean+stdev. Stdev is by no means a measure of the maximum by which samples can deviate from the mean, its a measure of how much they deviate from the mean 'on average'.

Clearly the 1/4 should go away. On Apr 16, 2016 14:27, "Benjamin Keen" notifications@github.com wrote:

Hey @TMSiefers https://github.com/TMSiefers, thanks for the info, and sorry for the bug in the code. Not being a mathematician, when I wrote this plugin I relied on someone else's code for the generation part - it would appear that that wasn't correct (or very possibly, I misinterpreted it somehow).

I've just been wading through some wikipedia articles on this. Yikes, my math is rusty.

I'd love to fix this, but I think I need a layman-friendly version of how the generation code should actually work. I tried just removing the / 4 part you mentioned but the generated data with a mean of 0 and a S.D. of 1 is no longer between the range of -1 to 1 (which is incorrect, no?).

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/benkeen/generatedata/issues/374#issuecomment-210868739

joelhoro commented 8 years ago

Btw you can try for yourself in excel, just write the formula =normsinv(Rand()), and copy it 100 times. If you then ask for stdev on the entire range it will be roughly 1 (and you'll notice many values are outside of [-1,1] On Apr 16, 2016 15:21, "Joel Horowitz" joel@horowitz.com wrote:

Well if stdev is 1 that means that roughly 68% of the population lies between mean-stdev and mean+stdev. Stdev is by no means a measure of the maximum by which samples can deviate from the mean, its a measure of how much they deviate from the mean 'on average'.

Clearly the 1/4 should go away. On Apr 16, 2016 14:27, "Benjamin Keen" notifications@github.com wrote:

Hey @TMSiefers https://github.com/TMSiefers, thanks for the info, and sorry for the bug in the code. Not being a mathematician, when I wrote this plugin I relied on someone else's code for the generation part - it would appear that that wasn't correct (or very possibly, I misinterpreted it somehow).

I've just been wading through some wikipedia articles on this. Yikes, my math is rusty.

I'd love to fix this, but I think I need a layman-friendly version of how the generation code should actually work. I tried just removing the / 4 part you mentioned but the generated data with a mean of 0 and a S.D. of 1 is no longer between the range of -1 to 1 (which is incorrect, no?).

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/benkeen/generatedata/issues/374#issuecomment-210868739

TMSiefers commented 8 years ago

benkeen,

Could you show how gauss() is generated (if data) or the script (if a function)? If I knew more about the source perhaps I could offer a solution.

benkeen commented 8 years ago

Sorry for the wait - that'd be great! I'll post it here tonight.

benkeen commented 8 years ago

Hey @TMSiefers, here's how the logic goes. Let me know if I'm not clear on anything - it's kind of dense to even describe.

Step 1: compute “gauss” (the note in the code says it’s the “Polar form of the Box-Muller transformation”)

At this point, A will contain a random value <1.0 or 0.0.

Step 2: the rest of the math

^^ "F" is the final value that gets outputted.

TMSiefers commented 8 years ago

@benkeen Your second bullet implies a while loop for which the result upon exit should be both that of A<1.0 AND A not-equal-to 0.0. However, my interpretation seems to be at at odds with your statement starting with "At this point A will..." Am I missing something?

Please advise, Tim S

joelhoro commented 8 years ago

I don't know much about php but there shouldn't be a need to implement your own things. This is all very standard and according to this page you can just use this function

stats_rand_gen_normal http://php.net/manual/en/function.stats-rand-gen-normal.php

2016-04-28 16:58 GMT-04:00 TMSiefers notifications@github.com:

@benkeen https://github.com/benkeen Your second bullet implies a while loop for which the result upon exit should be both that of A<1.0 AND A not-equal-to 0.0. However, my interpretation seems to be at at odds with your statement starting with "At this point A will..." Am I missing something?

Please advise, Tim S

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/benkeen/generatedata/issues/374#issuecomment-215560115

benkeen commented 8 years ago

Hey guys -

Tim, sorry about that. You're quite right, I'll update the "At this point" bit.

there shouldn't be a need to implement your own things.

:) That looks very promising, but it's a PECL function. To my knowledge, PECL doesn't come bundled with standard PHP distributions, so I wouldn't want to rely on it. Basically I don't want to add any additional steps for users to have to configure their server environment just to run this script - a little too prohibitive for some folk!

TMSiefers commented 8 years ago

@benkeen So according to the Wikipedia article on the Box-Muller transformation, all you need to do is drop the /4 in your code. The notes indicate the one-quarter reduction was necessary to "fit the distribution within our required range." I don't know why that was necessary, but you won't have data that matches the users' requirements until you get rid of that. I have not been able to connect the mathematical dots between the Gaussian limiting distribution and the transformation you're using, but I can roughly see the matching relationship between the two; so I'll assume the Wikipedia information is correct. Wikipedia article: https://en.wikipedia.org/wiki/Box%E2%80%93Muller_transform

Regards, Tim S

TMSiefers commented 8 years ago

@benkeen Also, you wrote that you take log(A) in your code, that should be for a base of e, so it must be the natural-log. Hopefully that is how the algorithm is written. (I usually assume log(A) to use a base of 10 by default).