bryanhanson / LearnPCA

Functions, Data Sets and Vignettes to Aid in Learning Principal Components Analysis (PCA)
https://bryanhanson.github.io/LearnPCA
10 stars 4 forks source link

Please Review #8

Closed bryanhanson closed 2 years ago

bryanhanson commented 2 years ago

As of right now, all vignettes are on the devel branch and the automatic builds work. I am working my way through the vignettes, trying to pseudo-finalize them. DTH if you could please review these vignettes and make any changes you desire -- no need to have a discussion unless you think it desirable.

I'll update this list as I complete my reviews.

dtharvey commented 2 years ago

Hi Bryan

I've looked over both and made a few suggested changes to the conceptual introduction: nothing big; more of an opportunity to see if I can push them back up. I have additional thoughts that I'll share here as they are more for you to think about and for us to talk about as needed.

  1. For the glass data set and its loadings plot, the difference in signs for Na2O, SiO2, and CaO contain information in that PC1 suggests that glasses with lots of calcium tend to have lower amounts of Na2O and of SiO2. You can see this really nicely when you plot CaO vs. SiO2 where shows a strong negative correlation.
  2. The social science data set is interesting. Is it worth identifying the one red outlier and the one blue state that is drifting toward being an outlier?
  3. When I knit this I got two warnings, one that ChemoSpect was built under R Version 4.1.1 and one for ChemoSpecUtils that it was built under R Version 4.1.1; I assume these should be suppressed.
  4. Dig you want to comment on the Scree and/or Scores plot for the spectroscopy data. It looks as through there is but one PC, which makes a plot of PC1 vs. PC2 not very useful.
  5. Do you want to say something about what a carbonyl is and the two types shown here; thinking about the political scientist who stumbles into here.

Take care, David

On Sun, Jan 16, 2022 at 6:46 PM Bryan Hanson @.***> wrote:

Assigned #8 https://github.com/bryanhanson/LearnPCA/issues/8 to @dtharvey https://github.com/dtharvey.

— Reply to this email directly, view it on GitHub https://github.com/bryanhanson/LearnPCA/issues/8#event-5900972683, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXZG6EV5BHTVF4A7KHKWALUWNKFDANCNFSM5MDLMPGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

-- David Harvey Professor of Chemistry & Biochemistry DePauw University @.***

dtharvey commented 2 years ago

Just pushed some suggestions. Let me know if you don't see them.

David

On Sun, Jan 16, 2022 at 6:46 PM Bryan Hanson @.***> wrote:

As of right now, all vignettes are on the devel branch and the automatic builds work. I am working my way through the vignettes, trying to pseudo-finalize them. DTH if you could please review these vignettes and make any changes you desire -- no need to have a discussion unless you think it desirable.

  • Start_Here.Rmd
  • Conceptual_Intro_PCA.Rmd

I'll update this list as I complete my reviews.

— Reply to this email directly, view it on GitHub https://github.com/bryanhanson/LearnPCA/issues/8, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXZG6DJK7FRJC6WN5MJHCDUWNKFBANCNFSM5MDLMPGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were assigned.Message ID: @.***>

-- David Harvey Professor of Chemistry & Biochemistry DePauw University @.***

bryanhanson commented 2 years ago

I see your changes, which look good, thank you. On your comments above:

  1. I have an idea to illustrate the correlation. I will add it along with some discussion prob. tomorrow.
  2. I looked for a long time to find that data set -- it was surprisingly difficult to find something in this discipline which was interesting and didn't involve negative topics, like rapes and murders. We can probably do even better but will have to keep our eyes open. Since the color-coding is based on data taken ~40 years after the main data, I'm not inclined to pursue it in any further detail (esp. if we can find a better data set).
  3. Those warnings are likely due to your R being older.
  4. I will adjust the wording.
  5. Great point, I will adjust the wording.
bryanhanson commented 2 years ago

Alright, I've finished my pass through all the vignettes. In the original checklist above you can click to show you've done the review. I'm going to do another pass after a break.

dtharvey commented 2 years ago

Made a few small suggestions for three of the vignettes and will push them up.

For Step-by-Step, when I look at the .Rmd file I see lots of the red circle with white x that flag code error in many of the code blocks (for example, line 89, 97, 98, and 108, but others as well). Seems to knit fine, however, so perhaps this is some issue with my version of R and/or packages such as ggplot.

For Step-by-Step, I see that the order of the footnotes in the .Rmd file is not the same as in the knitted file. I mention this because I have a comment on the first footnote in the knitted file, which is not the first in the .Rmd file. The footnote appears with this sentence from the main text: "As mentioned above, the FeCu data should be scaled to avoid the Fe2O3 values dominating the analysis, since these values are larger." Given the scales used, however, it actually is the Cu data that has the larger values (mean of 224 for Cu and 15.4 for Fe2O3). There is, of course, less Cu in these samples, which we see if we convert them from ppm to percentages: for the first sample, we would then have 9.50% Fe2O3 and 0.0594% Cu, or, if we convert Fe2O3 to ppm we have 950,000 ppm Fe2O3 and 594 ppm Cu. One thing that scaling does is remove from consideration the choices we make about how to report values.

On Visualizing PCA in 3D, I don't think I can quickly work on trying to implement the plotly versions of the figures, but would love to see what they might look like. Is that something you can add easily? If yes, then I might revisit the language of the text with those in place as it would invite prompts such as "Try....and note how it affects....". If no, then let's flag this is as something to work on for the next iteration. I still want to work up a "case study" using Brian Sauliner's data.

When I try to knit Functions_PCA I get a fatal error that reads "Line 157: Error in PCAtoXhat (prin): pca should be of class prcomp...". I think this popped up earlier, but don't recall the details. If you send the .html file, I can look that over and send comments by email.

Here is the book that I found most useful when thinking about visualizations: Chemometrics: A Practical Guide Kenneth R. Beebe, Randy J. Pell, Mary Beth Seasholtz Wiley: New York Copyright 1998 ISBN: 0-471-12451-6

bryanhanson commented 2 years ago

I will go through your comments a little later, thank you. On the knit error with Functions_PCA.Rmd, I think you need to build, check and load the package locally and then it should work. The message suggests you have a local version from before I made some changes.

dtharvey commented 2 years ago

HI Bryan

Ah...I built and installed the package and was able to knit. I read through the vignette Functions_PCA and didn't see the need to edit anything. I did notice that for both pca and prin, the first two rows of scores are identical; this is because the first two entries in the matrix mt are identical (in the original data set the Mazda RX4 and Mazda RX4 Wagon are identical on all variables except for wt and qsec, which aren't captured in the line mt <- as.matrix(mtcars[1:10, 1:5] as they are the 6th and 7th variables. Nothing wrong with that unless you want to make note that when two samples are identical, they yield, as one might expect, identical scores. Did you give any thought to comparing the three plots (scree, scores, loadings) for the two methods? For example, you could plot PC1 vs. PC2 for pca and for prin on the same axes using different colors. The symmetry of the results are clear, although you can see that in the numerical values as well.

David

On Sun, Jan 23, 2022 at 11:55 AM Bryan Hanson @.***> wrote:

I will go through your comments a little later, thank you. On the knit error with Functions_PCA.Rmd, I think you need to build, check and load the package locally and then it should work. The message suggests you have a local version from before I made some changes.

— Reply to this email directly, view it on GitHub https://github.com/bryanhanson/LearnPCA/issues/8#issuecomment-1019524396, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXZG6DNOWUN3J6N2L2KVGDUXQXHNANCNFSM5MDLMPGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- David Harvey Professor of Chemistry & Biochemistry DePauw University @.***

bryanhanson commented 2 years ago

Further responses:

Other than these items, I'm going to look over a few more articles I have stashed to see if there is anything that should be added to our work, or if I should add them to Works Consulted. I think we are getting very close to a release, what do you think?

dtharvey commented 2 years ago

Hi Bryan

I think it is a nice collection of articles, and a nice mix of high-level and closer-to-the-ground level documents. I'll be interested to see what comments arise from others.

I thought you might find the overlaid plots of the PC scores interesting. It is just another example of the difference in the directions in which the two functions happen to rotate the axes, giving rise to the sign differences. What appears as a positive score for one gives a negative score for the other. There are slight differences in the scores themselves, so it isn't perfectly symmetrical (compare, for example, the black dot and red on the horizontal dashed line, which are in slightly different positions. I think for the loading plots, you will see the variables reversing their signs. Not sure it adds anything other than another opportunity to make the point that you can define rotation in two directions.

David

On Mon, Jan 24, 2022 at 2:16 PM Bryan Hanson @.***> wrote:

Further responses:

  • The mysterious red circles you are seeing must be a feature or RStudio, and the fact they are showing up must be due to an older version of RStudio or one of the packages. I don't see them of course, and as you say, the document knits/renders fine.
  • On the footnote about scaling: good catch. I should have said Cu would dominate. So I changed that and improved the footnote to make clear that if we put both elements in the same units then it is Fe2O3 will dominate.
  • On Visualizing in 3D, let me look that over tonight and see how much work it will be to make all relevant plots in plotly If not too bad I will do that, if it's a lot we can remove the current plotly example and build it all in plotly later.
  • On mt <- as.matrix(mtcars[1:10, 1:5] I will look at that and probably choose different rows to avoid identical rows. I will also think about the plots you suggest.

Other than these items, I'm going to look over a few more articles I have stashed to see if there is anything that should be added to our work, or if I should add them to Works Consulted. I think we are getting very close to a release, what do you think?

— Reply to this email directly, view it on GitHub https://github.com/bryanhanson/LearnPCA/issues/8#issuecomment-1020453173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXZG6DFZZRKQRCDFMYNPODUXWQQ5ANCNFSM5MDLMPGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- David Harvey Professor of Chemistry & Biochemistry DePauw University @.***

dtharvey commented 2 years ago

Hi Bryan

Few thoughts on the vignette on comparing functions. I wonder if there is merit in showing the scree plot (or variance table) for each, along with the scores plot for PC1 and PC2 for each, and the first loadings plot for both either instead of the numerical output or in tandem with the output. These are the same ways you present results in other vignettes, so there may be merit in examining things through this lens here as well. This would show (1) that both methods return the same estimate about the number of underlying principal components, that (2) both methods suggest that the first two PCs divide the samples into the same two groups, and that (3) both methods show that the variables cyl, disp, hp track together on PC1 and that the variables mpg and drat track together on PC1 but in opposition to cyl, disp, and hp. Also (4) the symmetry in the scores plot arises from the difference in the signs of the loadings for PC1 and PC2 between the two methods, which shifts the signs of the scores. Finally, (5) the small difference in the absolute values of the individual scores on PC1 and PC2 between the methods must arise from the difference in the two computation methods, one that works with the 10x5 matrix of data and one that first converts it into a 5x5 correlation matrix.

Take care, David

On Mon, Jan 24, 2022 at 2:50 PM David Harvey @.***> wrote:

Hi Bryan

I think it is a nice collection of articles, and a nice mix of high-level and closer-to-the-ground level documents. I'll be interested to see what comments arise from others.

I thought you might find the overlaid plots of the PC scores interesting. It is just another example of the difference in the directions in which the two functions happen to rotate the axes, giving rise to the sign differences. What appears as a positive score for one gives a negative score for the other. There are slight differences in the scores themselves, so it isn't perfectly symmetrical (compare, for example, the black dot and red on the horizontal dashed line, which are in slightly different positions. I think for the loading plots, you will see the variables reversing their signs. Not sure it adds anything other than another opportunity to make the point that you can define rotation in two directions.

David

On Mon, Jan 24, 2022 at 2:16 PM Bryan Hanson @.***> wrote:

Further responses:

  • The mysterious red circles you are seeing must be a feature or RStudio, and the fact they are showing up must be due to an older version of RStudio or one of the packages. I don't see them of course, and as you say, the document knits/renders fine.
  • On the footnote about scaling: good catch. I should have said Cu would dominate. So I changed that and improved the footnote to make clear that if we put both elements in the same units then it is Fe2O3 will dominate.
  • On Visualizing in 3D, let me look that over tonight and see how much work it will be to make all relevant plots in plotly If not too bad I will do that, if it's a lot we can remove the current plotly example and build it all in plotly later.
  • On mt <- as.matrix(mtcars[1:10, 1:5] I will look at that and probably choose different rows to avoid identical rows. I will also think about the plots you suggest.

Other than these items, I'm going to look over a few more articles I have stashed to see if there is anything that should be added to our work, or if I should add them to Works Consulted. I think we are getting very close to a release, what do you think?

— Reply to this email directly, view it on GitHub https://github.com/bryanhanson/LearnPCA/issues/8#issuecomment-1020453173, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXZG6DFZZRKQRCDFMYNPODUXWQQ5ANCNFSM5MDLMPGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- David Harvey Professor of Chemistry & Biochemistry DePauw University @.***

-- David Harvey Professor of Chemistry & Biochemistry DePauw University @.***

bryanhanson commented 2 years ago

Added a comparison of variance explained to Functions_PCA.Rmd plus a few wording improvements. Take a look at the new version and see what you think.

dtharvey commented 2 years ago

Hi Bryan

Looks great. Love the plotly plots! The perspective of the static image I created for Figure 6 always left me worrying that the blue square and points might not actually be in a plane perpendicular to the PC1 axis. Rotating the plotly version makes it clear they do! Nice job on that! For Figure 11, however, something about it doesn't look quite right. I think it is the data cloud. Compare Figure 4 and Figure 10, which have the same data cloud, one with just PC1 added and the other with all three PCs Then compare Figure 4 and Figure 5, which seem to have the same data cloud. Finally, compare Figure 5 with Figure 11; the data clouds in these two look different from each other.

Take care, David

On Wed, Jan 26, 2022 at 11:56 AM Bryan Hanson @.***> wrote:

Added a comparison of variance explained to Functions_PCA.Rmd plus a few wording improvements. Take a look at the new version and see what you think.

— Reply to this email directly, view it on GitHub https://github.com/bryanhanson/LearnPCA/issues/8#issuecomment-1022394946, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAXZG6DCBS3IX2J3LEYJJXDUYARUTANCNFSM5MDLMPGA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

-- David Harvey Professor of Chemistry & Biochemistry DePauw University @.***