gfmoore / esci-correlation

esci correlation
GNU General Public License v3.0
0 stars 0 forks source link

Basic specification #1

Closed gfmoore closed 4 years ago

gfmoore commented 4 years ago

Spec for see r, component of esci web, Geoff, 24 Aug 2020

Name is see r with no hyphen, italic r

Start with see r page in esci web Home, 0.0.1, a nice starting point.

Two tabs, as in distributions: Same r, Dance r

All reported correlation values have no leading zero.

Bottom line in control panel: smaller font, to match other esci web pages.

Same r tab

Text on tab: Line 1: Same r

Lines 2, 3…: See data sets all with the same r

(Possibly that extra text should be not on tab, but in non-coloured area below tab and above Panel 1?)

Panel 1: N set by slider, min 4, max 300

Panel 2: Data set correlation r

Slider for r, from -1 to 1, steps of .01

Report value: r of data set [ ] ---just below chosen r (as in ESCI intro), should always match the value just above.

Panel 3: Button ‘New Data Set’

Text under button: …with correlation r

Panel 4: Click to display

[ ] Value of r in the figure

[ ] Cross through means

[ ] Marginal distributions

Panel 5: Descriptive statistics [ ] ---as in Panels 8, 9 in dances, at first see only this line and checkbox. Click to open, and then see:

2 x 2 table of values, two rows are Mean and SD, two columns are X and Y—as in the yellow area at red 4, in the See r page of ESCI for UTNS

Panel 6: Display lines [ ] ---as in Panels 8, 9 in dances, at first see only this line and checkbox. Click to open, and then see:

[ ] Y against X Slope […]

[ ] X against Y Slope […]

[ ] Correlation line Slope […] <the correlation line also goes through the means of X and Y, and has slope that is the geometric mean of the slopes of the two regression lines, just above>

Generating a sample, Same r tab

Same as for Dance r below, except:

Start by taking a random sample using r = target value of r.

Then adjust the y values using:

 5.  Dispy = r × zx × T + SQRT(1 - r2) × zy 

…where T (for tweak) is chosen iteratively so that the correlation r in the sample equals the target value. (See the same columns in ESCI See r. Uses Excel ‘Goal seek’)

Dance r tab

Text on tab: Line 1: Dance r

Lines 2, 3…: Sample from population with correlation r

(Possibly that extra text should be not on tab, but in non-coloured area below tab and above Panel 1?)

Panel 1: As for Same r tab above

Panel 2: Population correlation r

Slider for r, from -1 to 1, steps of .01

Report value: r of sample [ ] ---just below chosen r (as in ESCI intro).

                    Just below: 95% CI on r [  ,  ] 

Panel 3: Button ‘New Random Sample’

Text under button: …from population with correlation r

Panel 4: As for Same r tab above

Panel 5: As for Same r tab above

Panel 6: As for Same r tab above

Generating a sample, Dance r tab

The population is assumed bivariate normal, with X having a normal distribution, mean 0, SD 1, and Y having the same. In addition, the population correlation between X and Y is r, and the variance of Y is homogenous for all X, and the variance of X is homogeneous for all Y. See ITNS pp 309-311.

To generate a random sample, I’m reverse engineering from ESCI, also searching online, e.g.

https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec22.pdf

At the See r page of ESCI intro, see columns BA to BF, down from the coloured labels.

The steps, for each of the N points in the sample:

Generate Randx and Randy, being independent random numbers, uniform on (0, 1) Convert each into a z score, using zx = NORMSINV(Randx), and same for zy. In other words, Randx is the area under a standard normal to the left of zx. (zx and zy will have standard normal distributions.) Then Dispx and Dispy are the coordinates of a point to be displayed, where: Dispx = zx, and Dispy = r × zx + SQRT(1 - r2) × zy

Display, same for both tabs

As in See r page of ESCI intro, except that X and Y axes have tick marks on axes, at 0.1 intervals, from -3 to +3, values marked at integers (-3, -2, …0, …3)

Points with X and/or Y values outside (-3, 3) are shown as red dots on a boundary. (Set N = 300 and take samples to see some examples of red dots and where they are displayed. If marginal distributions is turned on, extreme value points are shown as solid black circles at an end of the marginal distribution.)

gfmoore commented 4 years ago

cut from above for correlation

Panel 1: N set by slider, min 4, max 300

Panel 2: Data set correlation r

Slider for r, from -1 to 1, steps of .01

Report value: r of data set [ ] ---just below chosen r (as in ESCI intro), should always match the value just above.

Panel 3: Button ‘New Data Set’

Text under button: …with correlation r

Panel 4: Click to display

[ ] Value of r in the figure

[ ] Cross through means

[ ] Marginal distributions

Panel 5: Descriptive statistics [ ] ---as in Panels 8, 9 in dances, at first see only this line and checkbox. Click to open, and then see:

2 x 2 table of values, two rows are Mean and SD, two columns are X and Y—as in the yellow area at red 4, in the See r page of ESCI for UTNS

Panel 6: Display lines [ ] ---as in Panels 8, 9 in dances, at first see only this line and checkbox. Click to open, and then see:

[ ] Y against X Slope […]

[ ] X against Y Slope […]

[ ] Correlation line Slope […] <the correlation line also goes through the means of X and Y, and has slope that is the geometric mean of the slopes of the two regression lines, just above>

Generating a sample, Same r tab

Same as for Dance r below, except:

Start by taking a random sample using r = target value of r.

Then adjust the y values using:

  1. Dispy = r × zx × T + SQRT(1 - r2) × zy …where T (for tweak) is chosen iteratively so that the correlation r in the sample equals the target value. (See the same columns in ESCI See r. Uses Excel ‘Goal seek’)

Generating a sample, Dance r tab

The population is assumed bivariate normal, with X having a normal distribution, mean 0, SD 1, and Y having the same. In addition, the population correlation between X and Y is r, and the variance of Y is homogenous for all X, and the variance of X is homogeneous for all Y. See ITNS pp 309-311.

To generate a random sample, I’m reverse engineering from ESCI, also searching online, e.g.

https://www2.stat.duke.edu/courses/Spring12/sta104.1/Lectures/Lec22.pdf

At the See r page of ESCI intro, see columns BA to BF, down from the coloured labels.

The steps, for each of the N points in the sample:

Generate Randx and Randy, being independent random numbers, uniform on (0, 1) Convert each into a z score, using zx = NORMSINV(Randx), and same for zy. In other words, Randx is the area under a standard normal to the left of zx. (zx and zy will have standard normal distributions.) Then Dispx and Dispy are the coordinates of a point to be displayed, where: Dispx = zx, and Dispy = r × zx + SQRT(1 - r2) × zy

Display, same for both tabs

As in See r page of ESCI intro, except that X and Y axes have tick marks on axes, at 0.1 intervals, from -3 to +3, values marked at integers (-3, -2, …0, …3)

Points with X and/or Y values outside (-3, 3) are shown as red dots on a boundary. (Set N = 300 and take samples to see some examples of red dots and where they are displayed. If marginal distributions is turned on, extreme value points are shown as solid black circles at an end of the marginal distribution.)

gfmoore commented 4 years ago

First stab.

Note. the jStat library allows me to take random samples from a normal (or other) distribution directly.

gdcumming commented 4 years ago

0.0.3

This looks great and is wonderfully fast. I guess the big question is how to do the (possibly) iterative tweaking to get the r of the data set to match the target. Easy or virtually impossible? I hope it's closer to the former, tho' I suspect not super-easy.

gdcumming commented 4 years ago

0.0.3

Weird. I rec'd an email notice from github, in the usual format, referring to #1 of esci-correlation, but I can't see it here, despite having reloaded the page.

The message said: "If marginal distributions is turned on, extreme value points are shown as solid black circles at an end of the marginal distribution.)

I don't understand this. What page in the book?"

Sorry for lack of clarity. There's no relevant pic in book. Here's a pic from ITNS, with N=300, and marginal distributions turned on. There are 3 red dots along the top, meaning (as you know) 3 dots that happened to land higher than the top boundary. Similarly, 2 red dots at left.

How are those 5 out-of-area points represented in the marginal distributions? --the 3 at the top are represented by the single solid black dot at the upper end of (actually just beyond the upper end of) the Y marginal distribution. That black dot is lined up with the red dots (on same horizontal line as the red dots). --the 2 at left by single solid black dot just beyond left end of X marginal distribution, the solid black dot being lined up with the two red dots, i.e. on the same vertical line as the red dots.

Your placing of the red dots looks good to me. The marginal distributions could be along the X and Y axes, close to those axes, just inside the display area--something like as below. I suggested considering having the axes run e.g. -3.1 to +3, so there's a bit of extra space for those marginal distributions. But only if that's easy and looks ok to you. Or just squeeze them in, just inside the left and bottom edges of the current display area.

Absolutely no hurry about any of this.

image

gfmoore commented 4 years ago

Yes, I made the comment, realised it was in excel and deleted comment. I thought there would be a notification, now I know better.

So a first stab.

Not sure so much about what you want from display lines.

I calculated linear regression equations for y on x and x on y and for the "SD line" or correlation line I used the average of the betas y = i.e. alpha + beta x.

For some strange reason the jStat package does not (yet) include regression. Have a look see if it looks right? https://jstat.github.io/all.html for documentation

I have used the jStat package to calculate the Pearson correlation coefficient from the generated data set.

If you want you could supply a few x, y coordinates and what values you expect and I'll test against what I get.

I'm taking the weekend off, so if you want to post issues for my Monday morning that wold be great :)

gfmoore commented 4 years ago

0.0.3

I guess the big question is how to do the (possibly) iterative tweaking to get the r of the data set to match the target. Easy or virtually impossible? I hope it's closer to the former, tho' I suspect not super-easy.

I don't understand this!

gfmoore commented 4 years ago

Uhmmm I've done the regression of x on y wrong. I'll rework it.

gfmoore commented 4 years ago

I'm currently using the following as a test dataset to start with x y -1 | -1 -0.5 | -0.5 -0.3 | 0.3 0.3 | -0.3 0.5 | 0.5 1 | 1

I'm getting .87 and 1.16 for the slopes yonx xony and should? be a slope of 1.0 for the SD line (as I know it? Berkeley uni)

I'm getting 1.01 ??

gfmoore commented 4 years ago

I think I found one issue in that the grid isn't square at different sizes. I need to check this out further. Sorry for these emails, but I don't want you "discovering" them :) Early days.

Oh, and I think I now understand what you mean by the iterative tweaking. You want me to figure out how to generate a data set with the specified correlation! Not just a random sample. I have no idea how to go about that at this stage.

gfmoore commented 4 years ago

Okay, sorted out the squariness! ;)

I'm adding a checkbox temporarily so you can see the test data or the random data - it's just under the button. Displays the test data above. The correlation line slope is wrong. Need help with this please. The x on y slope looks okay?

Going to bed now - Tour starts tomorrow :) - not today as I thought :(

gfmoore commented 4 years ago

0.0.6 All sorts of checks, rewrites, fixes.

Question. For x on y slope, is this from a y perspective (as displayed) or from the calculation of Sxy/Syy?

Also you will note the 90% confidence ellipse that I used to obtain the slope and line for the correlation line. (This can be removed, but working out correlation matrices, eigenvalues and vectors and working out semi-major, semi-minor axes and angles was a fun challenge. It is also how Galton originally presented his diagrams and I use it in my teaching. https://en.wikipedia.org/wiki/Linear_regression about half way down.

As I understand it the y on x line should hit the ellipse at the vertical tangents to the ellipse, and the x on y should hit the ellipse at the horizontal tangents to the ellipse. The correlation line should lie along the semi-major axis. And it sort of bisects the other two slopes - is this true?

For low N this doesn't seem to quite work - due to low N I guess.

I may have all this completely wrong. I have never much gone beyond simple regression in my teaching, though of course I know a little more, but not much more. :)

gfmoore commented 4 years ago

So I've spent some time trying to figure out how to generate a correlated data set that matches r. Unfortunately all avenues lead back to the Cholesky decomposition, which I discovered you are already using. So far I haven't found any other technique to try. Do you have any other ideas?

gdcumming commented 4 years ago

0.0.7

I'll make some general comments here, then open a couple of more specific issues.

As I said, the basic display, including axes and marginal distributions, looks good, in fact excellent.

Panel 1. Fine Panel 2. Fine, except the key thing of adjusting r for this data set to match target r. See separate issue. Panel 3. Fine Panel 4. Fine, tho' I'm only eyeballing positioning of cross through means as looking correct. Panel 5. Fine, tho' I'm only eyeballing values of M and SD. Panel 6. Lines. I'll open an issue.

I haven't been able to fault the squariness, which is great. OK also on ipad.

I'll take up comments and questions in previous comments above, in the two new issues.

Probably can close this issue?

gfmoore commented 4 years ago

Yep, let's open separate issues now.