Merge pull request #2 from geanders/master

jamdili1 commented 4 years ago

chapter 4 solution. James

geanders commented 4 years ago

@jamdili1 : Thanks for opening the pull request! That part worked, but I don't think that your commit with the new post has come through. Could you make sure that you both committed your changes in adding the post on your local computer and then also pushed it up to your repo on GitHub?

geanders commented 4 years ago

Success! Okay, now that it's up, we'll take a look and come back with some suggestions for editing to get to the final post.

jamdili1 commented 4 years ago

Brooke,

I think I got it and it it showing up in the git pull request.

On Wed, Mar 4, 2020 at 8:48 AM Brooke Anderson notifications@github.com wrote:

@jamdili1 https://github.com/jamdili1 : Thanks for opening the pull request! That part worked, but I don't think that your commit with the new post has come through. Could you make sure that you both committed your changes in adding the post on your local computer and then also pushed it up to your repo on GitHub?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/50?email_source=notifications&email_token=AOLTMGNVTYW7J4C7QHEYCIDRFZZ3JA5CNFSM4LBLISP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENYSBEA#issuecomment-594616464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTMGMCFMILDONWNSRHPMDRFZZ3JANCNFSM4LBLISPQ .

jamdili1 commented 4 years ago

[image: Screen Shot 2020-03-04 at 8.54.56 AM.png]

On Wed, Mar 4, 2020 at 8:55 AM James DiLisio dilisioja@gmail.com wrote:

Brooke,

I think I got it and it it showing up in the git pull request.

On Wed, Mar 4, 2020 at 8:48 AM Brooke Anderson notifications@github.com wrote:

@jamdili1 https://github.com/jamdili1 : Thanks for opening the pull request! That part worked, but I don't think that your commit with the new post has come through. Could you make sure that you both committed your changes in adding the post on your local computer and then also pushed it up to your repo on GitHub?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/50?email_source=notifications&email_token=AOLTMGNVTYW7J4C7QHEYCIDRFZZ3JA5CNFSM4LBLISP2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENYSBEA#issuecomment-594616464, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTMGMCFMILDONWNSRHPMDRFZZ3JANCNFSM4LBLISPQ .

geanders commented 4 years ago

@jamdili1 : Great start on this! I have some suggestions for edits, particularly for the first. I think you're definitely on the right track, so my suggestions are mostly for clarifying and adding some details to walk the reader a bit more step-by-step through the ideas:

In general, consider making the following edits where relevant:
- Any time that you mention an R package or function in the text, put the name in backticks. This will make it render in "typewriter" text rather than the regular font. This helps readers immediately see that you're talking about computer code rather than regular words. It will look like this---ggplot2---instead of this---ggplot2. You'll put this right in the code where you start the chunk, so instead of just reading ```{r}, it will read ```{r message = FALSE, warning = FALSE}.
- For the code chunks where you load packages, I recommend that you set the code chunk options to message = FALSE, warning = FALSE. This will keep it from adding all those messages about the loading the package and masking things.
For part a, I think they want you to just make a scatterplot, and then from that scatterplot try to guess how the data were generated (in this case, from how many distributions). I would recommend for this part that you:
- It's great to show a bit of the dataset when we start, so readers can get a feel for the data we're working with. However, don't print out the full dataset, as you currently do. In the blog post, this ends up giving a really long read-out. Instead, try using head or slice to show just the first few rows. Then, either right before you print it out or right after, walk the reader through the columns that are included as well as what each row (observation) measures. If we're not sure what they all are, you could at least say what the main ones we'll use are (x, y, and class).
- Next, I suggest just showing the scatterplot (your first figure). For what I think they're asking for here (how many components are in the mixture, rather than whether the data are normally, binomially, etc., distributed), you can see that with just the scatterplot. Also, the scatterplot is enough to show you that a normal distribution is reasonable. As a note, I don't think they ever really, in this exercise at least, want you to use the columns yb or yp, just yn (the continuous one). So, I would recommend taking out the second and third plots that you put in for part a and just focus on the scatterplot.
- I suggest that you first do the scatterplot without the color for group. I think that the authors are trying, in this chapter, to show you how to work with data that's a mixture, but where there aren't actual labels for the groups. Therefore, I think in this exercise, they're trying to get you to see how you would work with this data if the class column were missing. Then, once we try to work with the data without those labels, we can check how well we did by comparing our guesses for class labels with these real values (which we wouldn't have in real life if we were working with data like this). To change the scatterplot, you'll just need to take out color = class in the code.
- Finally, for part a, I think you probably want to talk about how this looks like a mixture composed of two groups. Then you could talk briefly about how it turns out it is, and that this example dataset gives us those class labels. This is where I would suggest showing your scatterplot with the class shown by color. That way, the readers can see what the data would look like in real life (when the class is a latent variable, so we don't know which points are in which class) and then show what we can see from the omniscient point of view, when we do know the class labels. For this plot, I recommend that you change the class column to a factor before you make the plot. That way, it will show just the two colors for 1 and 2, rather than showing a scale where it looks like values can range continuously between 1 and 2. Here's how you could change your code to do that:

NPreg %>%
  mutate(class = as.factor(class)) %>%.         # Change the `class` column to a factor
  ggplot(aes(x = x, y = yp, color = class)) +
  geom_point()

For part b, could you talk a bit more about the flexmix function? I recommend that you include:
- The package it comes from
- An overview of what it's doing (estimating parameter values using the EM algorithm, so that you can estimate at once both the model parameters and which group of the mixture each observation belongs to---check out the help function at ?flexmix and the package vignette at https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf for a bit more details to work in).
- Just a bit to explain why we're using the EM algorithm to fit a model for data like this. You'll want to include some key terms in that like "mixture" and "latent variable".
- What parameters you're inputting to the flexmix function (it looks like it takes a model formula, which gives the structure of the model you want to fit, the name of the dataset with the data to fit, and then k, which gives the number of components you think make up the mixture.
- It would also be nice to comment on why we've picked the model formula we have (x + I(x^2)), and why we're using I() in it (you can google 'r model formula "I()"') to find out more).
- It's really nice that you're summarizing the output you get from running print and summary once you've fit the model. Could you add a bit more details to walk the reader through how to interpret the output, though? For example, could you explain what they mean when they say "convergence after 17 iterations", and what "Comp. 1" and "Comp. 2" in the summary tables mean? Also, at this point, could you show the reader how they can extract the predicted class labels from the model output (clusters(m1))? I think it would be helpful for the reader to understand that, at this point, they have estimates of both the model parameters (which you would get from running a normal linear regression algorithm) and estimates of which component of the mixture each point belonged to (which you wouldn't get from running normal linear regression algorithm).
For part c, I think you're right on track. Just a few small things to add:
- Could you explain a bit more explicitly that NPreg$class gives the true class label of each observation, while clusters(m1) gives the guesses from the model fit with flexmix? I think this would be helpful to explain right before you give the code for the truth table.
- Maybe instead of using the words "cluster" and "clustering" in you explanation for this part, use the words "classes" / "class labels"?
- I think the following statement could be clarified: "As we can see 5% of data points are misclustered." Could you explain that 5% of the points that are truly in class 1 were misclassified by the model as belonging to class 2 (row 1 of the truth table), and similarly that 5% of the points that were really in class 2 were misclassified as being in class 1 (row 2 of the truth table)? That does indeed get you to 5% of the total data being misclassified, but I think that might take a few more mental steps than some readers would follow along with.
Part d is also almost there. My only minor suggestions for updates for that are:
- Could you convert the class labels to factors before plotting, so you will get a discrete rather than continuous scale? (see my suggestions and code on this in part a)
- I don't think they actually want you to do anything with modeling with a Poisson or binomial distribution for this question. I recommend you take out the parts on fitting those two models, and the results from it, from this section

If you make these changes and then push them up to your GitHub fork of the repo, they should be added to this pull request (just in case, send along an email when you push the changes, so I can make sure I haven't missed them).

jamdili1 commented 4 years ago

Excellent! I will look through this and make change to get back to you. Thanks for the feedback.

James

On Mon, Apr 6, 2020 at 3:58 PM Brooke Anderson notifications@github.com wrote:

@jamdili1 https://github.com/jamdili1 : Great start on this! I have some suggestions for edits, particularly for the first. I think you're definitely on the right track, so my suggestions are mostly for clarifying and adding some details to walk the reader a bit more step-by-step through the ideas:

In general, consider making the following edits where relevant:

Any time that you mention an R package or function in the text, put the name in backticks. This will make it render in "typewriter" text rather than the regular font. This helps readers immediately see that you're talking about computer code rather than regular words. It will look like this---ggplot2---instead of this---ggplot2. You'll put this right in the code where you start the chunk, so instead of just reading {r}, it will read{r message = FALSE, warning = FALSE}.

For the code chunks where you load packages, I recommend that you set the code chunk options to message = FALSE, warning = FALSE. This will keep it from adding all those messages about the loading the package and masking things.

For part a, I think they want you to just make a scatterplot, and then from that scatterplot try to guess how the data were generated (in this case, from how many distributions). I would recommend for this part that you:

It's great to show a bit of the dataset when we start, so readers can get a feel for the data we're working with. However, don't print out the full dataset, as you currently do. In the blog post, this ends up giving a really long read-out. Instead, try using head or slice to show just the first few rows. Then, either right before you print it out or right after, walk the reader through the columns that are included as well as what each row (observation) measures. If we're not sure what they all are, you could at least say what the main ones we'll use are (x, y, and class).

Next, I suggest just showing the scatterplot (your first figure). For what I think they're asking for here (how many components are in the mixture, rather than whether the data are normally, binomially, etc., distributed), you can see that with just the scatterplot. Also, the scatterplot is enough to show you that a normal distribution is reasonable. As a note, I don't think they ever really, in this exercise at least, want you to use the columns yb or yp, just yn (the continuous one). So, I would recommend taking out the second and third plots that you put in for part a and just focus on the scatterplot.

I suggest that you first do the scatterplot without the color for group. I think that the authors are trying, in this chapter, to show you how to work with data that's a mixture, but where there aren't actual labels for the groups. Therefore, I think in this exercise, they're trying to get you to see how you would work with this data if the class column were missing. Then, once we try to work with the data without those labels, we can check how well we did by comparing our guesses for class labels with these real values (which we wouldn't have in real life if we were working with data like this). To change the scatterplot, you'll just need to take out color = class in the code.

Finally, for part a, I think you probably want to talk about how this looks like a mixture composed of two groups. Then you could talk briefly about how it turns out it is, and that this example dataset gives us those class labels. This is where I would suggest showing your scatterplot with the class shown by color. That way, the readers can see what the data would look like in real life (when the class is a latent variable, so we don't know which points are in which class) and then show what we can see from the omniscient point of view, when we do know the class labels. For this plot, I recommend that you change the class column to a factor before you make the plot. That way, it will show just the two colors for 1 and 2, rather than showing a scale where it looks like values can range continuously between 1 and 2. Here's how you could change your code to do that:

NPreg %>% mutate(class = as.factor(class)) %>%. # Change the class column to a factor ggplot(aes(x = x, y = yp, color = class)) + geom_point()

For part b, could you talk a bit more about the flexmix function? I recommend that you include:

The package it comes from

An overview of what it's doing (estimating parameter values using the EM algorithm, so that you can estimate at once both the model parameters and which group of the mixture each observation belongs to---check out the help function at ?flexmix and the package vignette at https://cran.r-project.org/web/packages/flexmix/vignettes/flexmix-intro.pdf for a bit more details to work in).

Just a bit to explain why we're using the EM algorithm to fit a model for data like this. You'll want to include some key terms in that like "mixture" and "latent variable".

What parameters you're inputting to the flexmix function (it looks like it takes a model formula, which gives the structure of the model you want to fit, the name of the dataset with the data to fit, and then k, which gives the number of components you think make up the mixture.

It would also be nice to comment on why we've picked the model formula we have (x + I(x^2)), and why we're using I() in it (you can google 'r model formula "I()"') to find out more).

It's really nice that you're summarizing the output you get from running print and summary once you've fit the model. Could you add a bit more details to walk the reader through how to interpret the output, though? For example, could you explain what they mean when they say "convergence after 17 iterations", and what "Comp. 1" and "Comp. 2" in the summary tables mean? Also, at this point, could you show the reader how they can extract the predicted class labels from the model output ( clusters(m1))? I think it would be helpful for the reader to understand that, at this point, they have estimates of both the model parameters (which you would get from running a normal linear regression algorithm) and estimates of which component of the mixture each point belonged to (which you wouldn't get from running normal linear regression algorithm).

For part c, I think you're right on track. Just a few small things to add:

Could you explain a bit more explicitly that NPreg$class gives the true class label of each observation, while clusters(m1) gives the guesses from the model fit with flexmix? I think this would be helpful to explain right before you give the code for the truth table.

Maybe instead of using the words "cluster" and "clustering" in you explanation for this part, use the words "classes" / "class labels"?

I think the following statement could be clarified: "As we can see 5% of data points are misclustered." Could you explain that 5% of the points that are truly in class 1 were misclassified by the model as belonging to class 2 (row 1 of the truth table), and similarly that 5% of the points that were really in class 2 were misclassified as being in class 1 (row 2 of the truth table)? That does indeed get you to 5% of the total data being misclassified, but I think that might take a few more mental steps than some readers would follow along with.

Part d is also almost there. My only minor suggestions for updates for that are:

Could you convert the class labels to factors before plotting, so you will get a discrete rather than continuous scale? (see my suggestions and code on this in part a)

I don't think they actually want you to do anything with modeling with a Poisson or binomial distribution for this question. I recommend you take out the parts on fitting those two models, and the results from it, from this section

If you make these changes and then push them up to your GitHub fork of the repo, they should be added to this pull request (just in case, send along an email when you push the changes, so I can make sure I haven't missed them).

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/50#issuecomment-610058777, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTMGJCNKF7UJURSHZJNXTRLJF7VANCNFSM4LBLISPQ .

geanders commented 4 years ago

@jamdili1 : I just wanted to check in and see if you'd gotten the chance to make these updates?

jamdili1 commented 4 years ago

Brooke,

I have not been able to make the updates. Life and work came at me really hard and fast the last month. I am sorry I haven't been able to attend the class. Right now I don't have the bandwidth to accommodate this class, I apologize. I will try to make the updates that you recommended on my segment.

thanks,

James

On Mon, May 4, 2020 at 4:49 PM Brooke Anderson notifications@github.com wrote:

@jamdili1 https://github.com/jamdili1 : I just wanted to check in and see if you'd gotten the chance to make these updates?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/50#issuecomment-623748342, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOLTMGOFX6NVRGED6NANYR3RP5A7BANCNFSM4LBLISPQ .

geanders commented 4 years ago

That is no problem at all and completely understandable! No worries if you can't do this. We enjoyed having you join a lot of the class!

geanders / csu_msmb

Merge pull request #2 from geanders/master #50