Still to do: Skew, and Capture of next mean

gfmoore / esci-dances

A JavaScript implementation of the ESCI software simulator used in Introduction to The New Stastistics (by Geoff Cumming and Robert Calin-Jageman)

https://thenewstatistics.com/esci-js/

Other

0 stars 0 forks source link

Still to do: Skew, and Capture of next mean #11

Closed gdcumming closed 4 years ago

gdcumming commented 4 years ago

Skew: On 17 May I made the following cryptic comments in an email:

just quickly, for skew I used lognormal

Comments refer to the CIjumping sheet in ESCI intro chapters 3-8.

You probably know that you can unprotect the CIjumping sheet (no password), display gridlines and headings, then see all formulas

Cell BZ42: ...NORMDIST(LN((BY42-$BX$13)/$BX$12),0,$BW$8,FALSE)...

See comment at BX8

BW8 is degree of skew

gfmoore commented 4 years ago

Ok, so I looked at your lognormal code in Excel and couldn't figure it out.

I spent the whole day yesterday looking mostly at skew curves in the literature and playing with them on R and eventually found that a skew normal looked reasonably good. I can't say I really like any of the skew curves I've found yet. There's a whole field out there and most of it went over my head, though I was just skimming.

I then found an amazing approximation to the skew normal. It looked pretty frightening to code, but eventually I did it and with a few tweaks it looks pretty good. I say tweaks...always tweaks to get things looking right. Oh and it can do negative (left) skew as well.

Whilst doing this I recalled a comment made elsewhere? about the normal should flatten out as the sd is increased. I looked at my code and realised I was autofitting the pdf to fit the height available. So I took this out and because of the custom pdf routine decided to scale pdfs myself before they got drawn.

I think they look pretty good? The fill is still a problem and not great, but again I think it can do for a little while.

Th title of this issue includes "capture of next mean"? but I can't see a comment for that?

gfmoore commented 4 years ago

Approximation to Skew Normal due to Samir K. Ashour, Mahmood A. Abdel-hameed et al https://www.sciencedirect.com/science/article/pii/S209012321000069X

btw I'm ok to go back to lognormal, but you'll have to show me how to use it please.

gdcumming commented 4 years ago

That skew normal paper looks pretty complicated!

It's great to see a skew dist implemented. To my eye that skew normal looks ok, tho' I'm not mad keen on the appearance for large amounts of skew. I think I prefer the look of the lognormal. Tho' not essential.

I think (hope) lognormal should not be too hard. I followed Wikipedia: https://en.wikipedia.org/wiki/Log-normal_distribution

If Y = N(0, sigma^2) then X = EXP(Y) is lognormal

let c = my skew factor, 0.1 to 1. Then mean of lognormal =EXP(0.5c^2) and variance of lognormal =(EXP(c^2)-1)EXP(c^2)

In Excel, the ordinate (y) is calc as: NORMDIST(LN(x/(horizontal scale factor)),0,c,FALSE)

I don't know if all that helps, or whether wikipedia makes it simpler.

I see that I don't report the mu and sigma, for lognormal pop, as I suggested may be good for draw-your-own. Tho the values are there, cells BX9 and BX11

gfmoore commented 4 years ago

Okay, so after a lot of thinking, experimenting, I had a play with Geogebra (have you ever used it?) and found that it had lognormal built in already and then I could see what it did, then I found that jStat had lognormal.dist and got it. Really easy but I was confused by how the mean and sigma figured into it!

Anyhow what I did was tweak the curve for each value of skew and made it sort of fit the mean pf 50 and sd of 20. The curve has so much variability (not in the statistics sense.) Now whether that makes sense statistically I doubt, but since it is just to get a "shape" I'm quite happy with it. As you said it looks better than the skew normal (though not perfect in my eyes?)

Now here's the thing. In order to be able to make a random selection from the distribution and to be able to add fill bubbles to the distribution, I create an array (called popnBubbles :) that I randomly fill with points. (Using your ideas). This is sort of akin to a method I recall of finding pi (didn't it use needles?) Anyway it's a Monte-Carlo method I guess. I use the width of the display times a constant (500). That's about 3000 points. More takes too long and the fill looks poor.

Using these points to give the mean and skew for the skew distribution and the custom curve only gives approximate values for the mean and skew, so we need to use caution on these two distributions - it's a teaching aid?

If you wanted a real simulation we could have a different program, now that I know the technique.

Anyway, I've gone over it all and checked and checked and it looks okay, but as a programmer I'm probably only checking what I see, not what I don't see! (Donald Rumsfeld and unknown unknowns!)

Also i added in a negative skew by reflecting the x values, I don't think you can have a negative lognormal can you?

Hope you like it. I'll sort out the tablet version when I've got all these other issues resolved.

gfmoore commented 4 years ago

I noted some issue with filling the skew curve. Made some fixes. It is better, but it's not so good with steep verticals.

If you remember I create an array of random points that lie within the pdf values found in another array. I use a linear interpolation for the y values between two pdf x values, because there might not be as many points in the pdf. I need the random "bubbles" array to estimate the mean and sd of the distribution for skew and for custom. (Yes I know there's a formula for mean and sigma of a lognormal, but I've done so many tweaks to get the skews to look right that this method seems easier.

Now I've shown to myself that the random bubble centres fill the curve nicely, no areas not filled, but like the cosmic background radiation it isn't smooth!

Now how do I decide whether a bubble should be drawn. if the edge of the bubble cuts through the curve boundary then it should be drawn. The big question is how do I properly determine this. I've spent hours trying different methods, including some quite tricky ones (for me), but the steep sides keep getting at me, or I get gaps near the top on just one side (which is weird).

Anyway I've gone back to a simpler method that just looks at the bubble height + some small tweak amount and if less than the curve height (which I know) at that point then draw it.

As I said the steep sides can cause this to fail.

Again, another one to look at later.

gdcumming commented 4 years ago

I like those skew curves. The max skew may not be quite so extreme as that in ESCI, but it's enough for a great demo of the central limit theorem--take samples with N = 2, 4, 6, 10, etc, and see the mean heap move quickly in shape from the pop shape towards a symmetric normal. With high skew and very small N, the capture % ages are not accurate, because the statistical model underlying the calc of MoE is wrong--the pop,or more to the point, the sampling dist, is not normal.

I see what you mean about fill and steep ends. But it's not bad at all. I suspect it would be even better if you could bring the pop curve in front of the data point circles. I feel your pain with the infinite fill tweaks!

I just noticed--circles filling under the curve are blue. Could we also have the circles for the data values in the latest sample (just below the pop curve) also the same blue? (I know the dropping means have black outline, but that's ok, especially since they have solid colour fill.)

Have you tried with a smaller number of fill circles? I see ESCI used 400, but smaller display area. Not that your current no. is bad, but maybe fewer would look ok? But I'm happy if you prefer to leave it as is.

You mention about estimating the mu and sigma for the pop. Monte Carlo can be fine. We do need these fairly accurate if capture %ages are to be accurate, in particular with large N, when the mean heap will be very very close to normal, so we expect the capture % to be very close to C. I done a couple of runs of 5,000 samples and the capture %ages were ok.

While on fill, did I mention somewhere that the fill under rectangular often has a little bit of space above and below the fill, i.e. just under the top of the curve, and just above the lower axis. (You've probably tweaked this endlessly?)

I'll make capture of next mean a separate issue.

gfmoore commented 4 years ago

Quote: With high skew and very small N, the capture % ages are not accurate, because the statistical model underlying the calc of MoE is wrong--the pop,or more to the point, the sampling dist, is not normal.

Oh. I don't know what to do about this. I estimate the sampling mean and sd using about 50,000 bubbles. That is the width of the display area in pixels x 50. The 50 seems to give the best balance between having a filled look and a mottled fill. It has to be based on the width of the display device and the the width of the browser window.

Of course there is an issue with this that affects speed of falling means. With so many bubbles stored in the DOM adding and manipulating dropping means puts a strain on searching through the DOM to move the dropping means. If you turn off the fill you'll see what I mean, but the bubbles are still stored in an array.

I've redone the fill algorithm, again, to cater for the problem with different devices not displaying correctly.

I'll look into drawing on top of the fill, good idea.

gdcumming commented 4 years ago

0.3.30

Hi skew, lo N, and capture %ages: That's an issue with the overly-simple statistical model we are using, and NOT any problem in ESCI-JS! There's nothing to fix! It's a thing we need to point out in the book, after folks have seen it in action. Always consider the limits of the statistical model you choose to work with! Using some other approach to calculating a CI, e.g. bootstrapping, could easily be better in such cases. In practice, very often the best we can do is to get an approx CI. In practice it (very) rarely matters if a nominally 95% CI actually captures on 95.5 or even 96.5 or 93% of occasions.

Fill looks excellent to me. No need to fiddle further, imho.

Bringing curve to front--great if you can do that without too much hassle.

Still on the list: Could we also have the circles for the data values in the latest sample (just below the pop curve) also the same blue as the fill? (I know the dropping means have black outline, but that's ok, because they are means, not data points (and also have colour fill)) Ideally, those data point circles would look as similar as possible to the circles in the fill under the curve.

gfmoore commented 4 years ago

Okay, brought curve and sd lines to front. Changed colour of sd lines to darkgrey from lightgrey - you may want them black - easy to change just css. (Though I'd then probably have to change the capture mu line - also dark grey.

Changed stroke of sampled items to blue.

The reason why the stroke of the dropping means was changed to black is that when they were blue the blue and the dark green "dithered" to give a purplish tinge (so my wife said) and it just looked wrong, so changed to black.

Hopefully we can close this now? Or after any colour changes.

gdcumming commented 4 years ago

That all looks terrific! Closing...