DS4PS / cpp-524-sum-2020

Course shell for CPP 524 Foundations of Program Evaluation II for Summer 2020.
http://ds4ps.org/cpp-524-sum-2020/
0 stars 1 forks source link

Lab 02 Question #6

Open gitamanda opened 4 years ago

gitamanda commented 4 years ago

This is something I noticed when I was re-doing Lab 02 to be sure I understood how to complete it. When I run the code chunk in RStudio which contains the calculation for a chi-squared p-value, I get an answer. However, if I knit the document to HTML, the chi-squared p-value is almost always slightly different. Why might this be? Is the dataset being updated that frequently? Also, is this a better question for our Tuesday discussion session?

lecy commented 4 years ago

Try this:

set.seed( seed=1234 )
# code here

Does it change now?

In statistics there are two types of estimators. There are closed-form solutions where a mathematical proof has been derived to show the parameters that provide the best solution to the problem (usually minimizing the residuals or maximizing accuracy of the classifier if the goal is prediction). These will be precise numbers, and no matter now many times you run the model (unless you change the data inputs) you will get the exact same solution.

And then there are stochastic estimators. These are necessary when there is no closed-form solution to a problem, so algorithms have been developed to come up with approximate solutions. Many do this through a search process, which will be sensitive to the starting point, or the seed in the random number generator. They will typically iterate through 10,000 steps of some search process that gets you very near to the best solution, but never exactly the same number.

This visual example of a type of stochastic estimator is called a "hill climber" algorithm. The goal is to find the highest point on the data plane, which represents the best solution for the model (a slope that minimizes the residual error in a regression, for example, or an approximation of the p-value in the chi-square test).

Each orange X represents a different start point or random seed in the estimator, and they are all searching for the highest hill. Some get stuck on top of lower hills because they don't want to take to many steps down a hill (toward worse solutions). Most of the X's represent cases that have different start points and find the tallest hill which represents the best model parameters, but you can see they keep bouncing around the top (which is why you get small differences in your p-values when you re-run the chi-square tests).

Thus if you change the seed, you will change the results slightly (in some special cases the results could change greatly, but that's a longer discussion).

Which begs the question, if you don't set the seed explicitly, what does R use as the seed value?


Are you sorry you asked?

gitamanda commented 4 years ago

Thank you so much for this explanation. It took a little bit of reading and re-reading, but I think I understand: some of the algorithms have not yet attained an "exactness". So, is the chi-squared p-value a stochastic estimator? And if so, why did we only do 1,000 iterations during the assignment, instead of 10,000? Thanks again for explaining, it really helps to solidify my understanding of these concepts.

gitamanda commented 4 years ago

Never mind, I realized my mistake: we are supposed to iterate through 10,000 repetitions instead of 1,000.

lecy commented 4 years ago

Typically when you are first building your models you set the number of iterations low because even 100 iterations will give you a reasonable solution. But when you want to create your final report for your clients or publish your paper in a peer-reviewed journal you will increase the N to 10,000 since typically that will get you very close to the actual solution that provides the best model fit.

It completely depends on the data and algorithms, but in some instances it might take a few seconds to run the algorithm (the chi-square test for example). In other cases it might take a few days. So you would want to calibrate your model and feel confident before you do your final run, which could take a week, to present that one result.

Once you leave the regression context and get into the machine learning world many of the models options are trying to balance accuracy and run time. You can get an answer quickly, but it might be 80% of what the best model fit would offer. Or you can get to 99% with a different model or different argument settings, but it might take a long time.

It's kinda like real life, where you pay for everything in time and in money. You can get something cheaply, but it costs a lot to shop for discounts and to fix it when it breaks. Or you can pay a lot for convenience and quality, which buys you a lot of time. You should be making decisions by thinking about each purchase with both criteria in mind, not just optimizing one or the other. The solutions that balance both will probably provide the most value!