New Module Idea: Normal Distribution Simulation

ProfessorAmanda commented 2 years ago

Hi all,

This is just the beginning kernel of an idea, but if Wayne has time, perhaps he can start to mock this up while I scratch my head about all of the features I want to include.

The basic idea is to provide a playground for students to experiment with how the normal distribution describes datasets and determines probabilities. Students especially often struggle with the connection between the value on the axis and the area under the normal curve. For now, let's focus on the normal distribution, but really this could generalize to any continuous probability distribution.

Wayne, you will want to familiarize yourself with the normal distribution and with javascript's functions for drawing normal density functions, for finding the area under the curve given a value on the axis, and for finding a value on the axis given a probability (this would be an inverse normal probability distribution). I believe we use these in various modules already.

To learn about how the normal distribution works conceptually, check out Chapter 4 of OpenIntroStats: openintro-statistics.pdf

My sketch in progress: Simulation_NormalDistribution.docx

waynew99 commented 2 years ago

I've made a prototype on shading under the curve, and it's up on NormalDistribution branch. It turned out to be fairly doable and we can surely build more stuff around it!

ProfessorAmanda commented 2 years ago

It's so great!! I already love how you can see the effect that changing the mean and standard deviation has on the area under the curve.

So now we can get serious about this. Fantastic!

Next steps:

[x] I think we only need one picture, i.e. combine the mean, standard deviation, and shading steps together as the first step the user performs.
[x] Put the choice of >= or <= inside this statement: P (x [selection] [value]). We can experiment with how this looks. I'm not totally sure what to do with the "value" part?
[x] The hardest thing will now be to report the area under the curve. Start looking into the normal distribution functions in javascript.

waynew99 commented 2 years ago

Got it! Thanks! I'll work on them and let you know when I make progress.

waynew99 commented 2 years ago

I've made some progress on the module. They can be checked out on branch NormalDistribution

The two graphs are combined into one.
The choices are put in one line inside the statement. However, it looks pretty rough. I'll keep experimenting and improving the aesthetics.
The area under the curve is reported. One bug to be fixed: the area rounds up to 100.00% when it's larger than 99.995%. I'll work on fixing this one.

ProfessorAmanda commented 2 years ago

I think it looks great! The math seems right, as much as I was able to see from playing around with it. And I actually like how you have written the probability line for now. If you have an idea for a better way to represent it, let me know, but I think it is fine as is for now. Some notes on what I see so far:

[ ] Just given the way I teach this material, I'd rather have the probabilities reported as a decimal than as a percentage.
[ ] Here is some text for the Intro: The Normal Distribution is one of the most important probability distributions, because it describes a wide range of natural phenomena. You can uniquely identify any normal distribution with just its mean and standard deviation. This module demonstrates how the mean and standard deviation determine the probabilities calculated from the normal probability density function. It is also possible to test whether a given dataset follows a normal distribution by using a chi-square goodness-of-fit test.
[ ] Here is some text for the menu: The normal distribution can be described entirely by its mean and standard deviation. Many natural phenomena can be described by this distribution, and it is possible to test whether a given dataset follows a normal distribution.
[ ] It would be nice for the user to be able to enter numbers in the probabilities statement, rather than just scrolling to them. I'm attaching a screenshot below of the error I got when I clicked the backspace key to try to enter my own number.

I am so excited to be able to use this for teaching!

As a next step, I'd like to be able to have the user click a button and "draw" a set of observations from the distribution they have plotted above and show the user a table of those values and a dot plot of those values (similarly to how we plot them in, say, the law of large numbers or central limit theorem modules). The object of this is to further reinforce what is meant by a "distribution." You'll need to look into how javascript can generate a list of random numbers (and crib a little from what previous research assistants have done). This is a nice next step for you to look into, because we will also need this for the test of normality.

Thanks and let me know if you have questions!

waynew99 commented 2 years ago

Got it! I've fixed the issues listed above, and I'm working on the next step of drawing observations.

ProfessorAmanda commented 2 years ago

Fantastic! Keep me posted.

Amanda G. Gregg Associate Professor of Economics, Middlebury College Join My Personal Zoom Roomhttps://middlebury.zoom.us/my/agregg?pwd=OWlGMmZMSWJaUkowRG5DUWJtRm9CQT09 (Password: EconHist) Office: Farrell House 101 Office Phone: (802) 443 - 3419<tel:+18024433419> Pronouns: she/her/hers

From: Wayne Wang @.> Sent: Wednesday, July 6, 2022 11:48:20 AM To: ProfessorAmanda/econsimulations @.> Cc: Gregg, Amanda G. @.>; Assign @.> Subject: Re: [ProfessorAmanda/econsimulations] New Module Idea: Normal Distribution Simulation (Issue #295)

Got it! I've fixed the issues listed above, and I'm working on the next step of drawing observations.

— Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FProfessorAmanda%2Feconsimulations%2Fissues%2F295%23issuecomment-1176385781&data=05%7C01%7Cagregg%40middlebury.edu%7C9487ac7be9054c43925108da5f66f42a%7Ca1bb0a191576421dbe93b3a7d4b6dcaa%7C1%7C0%7C637927193041413560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=HodVdP4fsjgxgHRgBP0%2FiM4OKa62n99iiuGLXuMD9u0%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAMWGM7HTC6TS2ES2AM5LWKDVSWTEJANCNFSM52GBPX6Q&data=05%7C01%7Cagregg%40middlebury.edu%7C9487ac7be9054c43925108da5f66f42a%7Ca1bb0a191576421dbe93b3a7d4b6dcaa%7C1%7C0%7C637927193041413560%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=Udw%2F%2FG%2F9hHcdMAcUO5cIBUJ3HvioN%2FWVLxYELFOczak%3D&reserved=0. You are receiving this because you were assigned.Message ID: @.***>

waynew99 commented 2 years ago

The drawing samples part is up on the branch! Please let me know what you think. (Sorry! That took a bit longer than expected due to a hard-to-debug highchart usage)

ProfessorAmanda commented 2 years ago

Hey! Sorry I missed this update. That definitely is working very well. I have a few comments just about formatting, to make it clear to user that the drawing samples step is distinct.

[ ] Have the user press a button with the text "Experiment with Drawing Samples from This Distribution" to reveal the rest.
[ ] I think the sample size should be entered rather than a slider option...maybe copy how we've done this in other modules?
[ ] The "Draw Samples" button should be "Draw a Sample"

Let me know what you think! I'll get started on the next part.

waynew99 commented 2 years ago

I've made the changes:) Putting "Experiment with Drawing Samples from This Distribution" in a button made the button a bit too wide, so I've put it as a line of text. Is it fine? Screen Shot 2022-07-14 at 11 22 38

ProfessorAmanda commented 2 years ago

I love the way everything in the blue box looks! Those were great aesthetic choices.

Can I make a suggestion about where things are located on the page? Can we center the blue box and then center the data that is drawn below that blue box?

waynew99 commented 2 years ago

Of course! It's up on the branch and you can take a look.

ProfessorAmanda commented 2 years ago

Thanks! Looks great.

Oh, one question (I didn't notice this before): What is the rule that determines when an observation will be highlighted in the sample table? I thought it might be whatever was described in the probability rule, but it seems to be "two-tailed." Like, if the rule is P(x>1), I noticed that observations less than -1 were also highlighted in blue. I hadn't even thought of including the highlighted observations, so you could just remove the highlights, or make the highlights agree with the probability rule specified by the user.

waynew99 commented 2 years ago

Sure. I think the currently highlighted rows are the points that are drawn.

I realized that I was probably doing the samples not the way we wanted – on every "draw a sample" click, a brand new 100 points population is generated, and a desired number of them are selected. I think we want instead is that when mean and standard deviation change, we have a new population of points, and on every "draw a sample" click, we select samples from that same population. Is that correct? If so, should we keep highlighting the rows of the points that are drawn?

ProfessorAmanda commented 2 years ago

Hi Wayne, yes exactly, you want to drop a new set of points from a normal distribution with the selected mean and standard deviation. You shouldn't be highlighting points, because they all should be plotted. (There is no distinction between sample and population in this case).

ProfessorAmanda commented 2 years ago

HI Wayne, also, here is the rest of the mockup for the Normal Distribution simulation, including the walkthrough of a goodness-of-fit test. It has many moving parts, so we can get started and then figure out where to go from there. Simulation_NormalDistribution.docx

waynew99 commented 2 years ago

Got it! Thanks a lot! Just to double check, we want is that we generate a set of points (the user defines the size) that has the defined mean and standard deviation – not drawing a set from a larger population. In this case, should all the generated points fall in the range defined by the above P(x>x1)? Or do they fall over the entire spectrum?

ProfessorAmanda commented 2 years ago

Yes exactly -- draw a set of points with that mean and standard deviation. (It's really still a sample from a theoretical "potential" population, if that makes sense). The points should fall from all over the spectrum, not just the range defined by the probability statement.

It might be nice to highlight the points that do fall in that range defined by the user, since the highlighted points should be the fraction of the points shown by the probability.

waynew99 commented 2 years ago

Thanks! That makes sense. I've made the changes and pushed onto the branch. I'll get started on the next part.

waynew99 commented 2 years ago

Hi Amanda, I've pushed a basic prototype including everything up to dividing the histogram into bins. I'll keep working on the rest of it. Please let me know any changes you would like me to make! Thanks!

ProfessorAmanda commented 2 years ago

This looks like a good start!

I have one note on this: before dividing the data into bins, I'd like to show the points as dots, as you did above, so that the user can see the "raw" sample points.

waynew99 commented 2 years ago

I'm still working on the rest of the module. I've pushed some updates on displaying the raw points and letting the user decide the number of bins. I'm still working on the rest of it where we display the table, as the way Highcharts divides its data into bins is a bit tricky to work with.

ProfessorAmanda commented 2 years ago

Very cool! Maybe hold off on displaying the blue bins until the user inputs the number of bins they would like, but otherwise, proceed! I bet the visuals are a little tricky.

waynew99 commented 2 years ago

It took me some time to fix the little bugs, but I think I got it working! The table is now synced with the histogram, and also shows the expected frequency.

ProfessorAmanda commented 2 years ago

Amazing!! I think this is actually going to work. Here are some comments/questions:

[ ] The histogram and data should have the same x-axes, since the histogram bins should correspond to the data values. You only need to show the y-axis for the histogram.
[ ] The next step will be to calculate the chi-square test statistic and then actually run the hypothesis test. Have you ever learned about hypothesis tests before? I can give you a quick primer. For the chi-square test, it's actually pretty simple. Find the javascript function that gives you a p-value for a given value of Chi-square with 3 degrees of freedom. If that p-value is less than alpha, we reject the null hypothesis.
[ ] This is going back up a step, but in the first part of the module, when the user draws a random sample of however many points, could you display the proportion of the points that end up highlighted in green? It should roughly correspond to the area under the curve. That might be a good teaching tool for me to help students understand what the area under the curve represents.

Thanks so much for your amazing work, Wayne!

waynew99 commented 2 years ago

Of course! Thanks for the comments. I will proceed to work on the next steps.

I have a quick question regarding the histogram's x-axis. Right now, the data values are reflected in their y values, and their x values are merely index. If the histogram were to share the same x-axes with the data values, we would need to swap the current x and y axis for the data values. For normal distributions, this would result in the data values spread out vertically along their mean on the x-axis. Are we okay with this behavior?

Please let me know if my description makes any sense.

ProfessorAmanda commented 2 years ago

Hi Wayne, yes, that's right. The data values should be on the x axis, and the y-axis should represent frequencies of those data values. For a histogram with vertical bars, the x-axis values show you which values of the data points belong in which bins, and the y-axis shows you how many points fit that criterion (if that makes sense).

waynew99 commented 2 years ago

Got it! Thanks, Amanda! I've made the change and pushed. I'm starting the Chi-squared module.

ProfessorAmanda commented 2 years ago

Thanks, Wayne! The axes look right now. I think I now see an error in the generation of the uniform dots. When I plotted a sample with a mean of 0 and standard deviation of 1, I got a range of points from -10 to 10. The formula for the standard deviation of a uniform distribution is sqrt((B-A)^2/12), where A and B are the endpoints. You might want to check which parameters the given javascript function requires.

waynew99 commented 2 years ago

Currently we are using distribution generation functions from this library: https://statisticsblog.com/probability-distributions/#uniform. The uniform distribution generation function takes in three parameters: sampleSize, lowerBound, and upperBound. I'm not exactly sure how we can feed standard deviation to this function. Do you think we should explore functions from other libraries instead?

ProfessorAmanda commented 2 years ago

I think it should be possible to algebraically back out the lower bound and upper bound from the mean and standard deviation. Let me do some algebra for a few minutes (I'm procrastinating on something scary, lol).

ProfessorAmanda commented 2 years ago

Hi Wayne,

Here are expressions for the lower bound and upper bound using the mean and standard deviation:

Lower bound = mean - stddevsqrt(3) Upper bound = mean + stddevsqrt(3)

Give those a shot, and let's see if the picture looks more reasonable.

waynew99 commented 2 years ago

Ah, got it! Thanks a lot. This makes a lot of sense. The relationship between the standard deviation and the two bounds didn't click in my head for some reason. I'll give this a shot.

waynew99 commented 2 years ago

Hi Amanda,

It's up on the branch. Looks like the bounds are now changing correctly according to the stddev and mean. Please let me know if it looks good to you!

ProfessorAmanda commented 2 years ago

Awesome! Looks much more reasonable.

[ ] Is it possible for the size of the graph to adjust based on the standard deviation the user chooses?
[ ] Sorry I didn't catch this earlier, but can you check how you are calculating the expected frequencies? It looks like your expected frequencies follow a uniform distribution (same count across all the bins). We'd like expected frequencies based on a normal distribution, which will tend to have more points towards the middle. You will need to figure out the probabilities of being between the endpoints of each bin for a normal distribution, then multiply that by the total sample size, to figure out what number we expect in each bin.

If you tell me what javascript function you have to figure out normal probabilities and what its inputs are, I can help you figure out the probabilities for each bin.

waynew99 commented 2 years ago

Sure!

For the size of the graph, do you mean changing the scale of the x-axis to make sure we don't have a too much "squeezed" chart for a low stadard deviation value?
For the expected frequencies:
- For normal distribution, I'm using the function (nd.cdf(bin.upperBound) - nd.cdf(bin.lowerBound)) * sampleSize, in which nd.cdf is the library function I'm using to calculate the cumulative distribution function for the normal distribution.
- For the uniform distribution, I'm simply doing sampleSize / numberOfBins for each bin.

ProfessorAmanda commented 2 years ago

Yes exactly.
Aha! You want to use the normal expected frequencies in both cases, since the idea is to test whether the dataset is drawn from a normal distribution. In the case where the dataset is uniform, you'll have that normal cdf in the background (given the mean and standard deviation the user inputs), but you'll never plot it. Eventually, we will hide from the user the information about whether you draw from the uniform or normal distribution to get the dataset.

waynew99 commented 2 years ago

I see! Thanks! I've pushed the changes.

ProfessorAmanda commented 2 years ago

Lookin' good! Proceed!

waynew99 commented 2 years ago

Hi Amanda,

I spent some time digesting the concepts of hypothesis tests, chi-squared distribution, and p-value. I found this "Chi-square goodness-of-fit test" function from this library: https://stdlib.io/docs/api/latest/@stdlib/stats/chi2gof. I'm not entirely sure that I'm understanding it correctly, but what I have right now is that I'm feeding the goodness-of-fit test with the observed frequencies for each bin, expected frequencies for each bin, and user defined alpha. The function performs the calculation and returns the result, including the pValue and the test statistic. Depending on whether the pvalue is smaller than the alpha, we either accept or reject the null hypothesis.

The changes are on the branch. Please let me know if any of these is incorrect. Thanks!

ProfessorAmanda commented 2 years ago

You've done it!! I really think this is working. I'm going to check the numbers another time, but for, just a few small things.

[ ] I was really glad to see you implement the option of chi-square as a randomly selected distribution shape. I think it's range looks correct given the inputted mean and standard deviation, but I can check that later.
[ ] Okay now for some small text things. There is a typo in the text "Randomly chose [whatever] as distribition shape." (word "distribution").
[ ] When the user tries to use too many bins, the text should say: "Cannot ensure at least 5 sample points per bin." (not "at least 5 samples")
[ ] In the conclusion to the hypothesis test, replace "Accept the null hypothesis" with the text "Fail to reject the null hypothesis." The only possible outcomes are "Reject the null hypothesis" or "Fail to reject the null hypothesis."

Thanks again, Wayne! I think this will be ready for deployment and beta testing soon. Amazing!

If we really run out of stuff to do, I have an idea for a crazy simulation demonstrating why this hypothesis test works....but let's table that for now :)

waynew99 commented 2 years ago

Awesome! Thanks for pointing out the typos. I need to turn on spellcheck in my code editor...

Before we come up with a plan for the next module, I can pick up working on some of the long-term enhancement issues.

ProfessorAmanda commented 2 years ago

Sounds great!

ProfessorAmanda commented 2 years ago

Maybe we should try to deploy this and ask for beta testers? I can tweet it out to friends.

waynew99 commented 2 years ago

Sounds good! I will create the PR.

ProfessorAmanda commented 2 years ago

Hi Wayne,

In the latest version of the Master branch, I'm getting this error when I try to run the Normal Distribution simulation. I think this is the only one that gives me this error. I ran "npm install" just in case, and that did not fix it. Can you take a look? Screen Shot 2022-08-03 at 10 58 13 AM

waynew99 commented 2 years ago

Did you try npm install --legacy-peer-deps ?

ProfessorAmanda commented 2 years ago

Ah shoot no in my sleepy hurried state I did not! Let me try again in a bit.

Amanda G. Gregg Associate Professor of Economics, Middlebury College Join My Personal Zoom Roomhttps://middlebury.zoom.us/my/agregg?pwd=OWlGMmZMSWJaUkowRG5DUWJtRm9CQT09 (Password: EconHist) Office: Farrell House 101 Office Phone: (802) 443 - 3419 Pronouns: she/her/hers

From: Wayne Wang @.> Sent: Wednesday, August 3, 2022 11:10 AM To: ProfessorAmanda/econsimulations @.> Cc: Gregg, Amanda G. @.>; Assign @.> Subject: Re: [ProfessorAmanda/econsimulations] New Module Idea: Normal Distribution Simulation (Issue #295)

Did you try npm install --legacy-peer-deps ?

— Reply to this email directly, view it on GitHubhttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FProfessorAmanda%2Feconsimulations%2Fissues%2F295%23issuecomment-1204077465&data=05%7C01%7Cagregg%40middlebury.edu%7Cb444061f4e704273481908da75623ea4%7Ca1bb0a191576421dbe93b3a7d4b6dcaa%7C1%7C0%7C637951362083035374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=djaGjcA9DVY8rFHE79%2BpaexqxecbOLzDOo6pKJ8nYRw%3D&reserved=0, or unsubscribehttps://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAMWGM7FJQBT5KMEJAB7OHN3VXKDUXANCNFSM52GBPX6Q&data=05%7C01%7Cagregg%40middlebury.edu%7Cb444061f4e704273481908da75623ea4%7Ca1bb0a191576421dbe93b3a7d4b6dcaa%7C1%7C0%7C637951362083035374%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=vmAan3bQeqdhOl7NCDpZa84lnMiIbVUCJd0jBayqyoc%3D&reserved=0. You are receiving this because you were assigned.Message ID: @.***>

ProfessorAmanda commented 2 years ago

Hi Wayne, can you take a look at the expected frequencies for the case where the underlying data happens to come from a chi-square distribution? I think they might still be calculated incorrectly. Thanks!

waynew99 commented 2 years ago

Sure. Working on it!

ProfessorAmanda / econsimulations

New Module Idea: Normal Distribution Simulation #295