Vocab 11 Pull Request - Githubissues

Here is the vocab 11. I'm not sure if the "2020-04-13-vocabulary-for-chapter-11.Rmd" is in the proper folder or not. Let me know if you have any issues! There was also a lot of things the commited that I'm not entirely sure where they came from... but hopefully it makes sense to you!

@camronp : Really nice start on this! I just had a few suggestions on some edits, all pretty minor. If you make these changes and push them to your own GitHub repo, then they should come through on the Pull Request as long as we have it open (send me an email when you push just in case so I can make sure it came through).

For "slots", I recommend changing the term to "slot", so it will agree with the definition in terms of plural / singular, and adding to the beginning "In the context of object-oriented programming in R,"
For "classification", I think they might mean that as a process, rather than as the resulting set. I suggest changing to something like "the process of grouping observations in a dataset by their similarities in terms of measured characteristics"
I think you're on the right track with "feature extraction", in terms of capturing why we often do it. However, I think the definition is missing a bit of the heart of the concept. In particular, it would be helpful to include the idea that this is a way of creating new measurements (i.e., new columns in a dataset) from the data you're given (or have measured). I think Wikipedia has some text that would be helpful in this definition (https://en.wikipedia.org/wiki/Feature_extraction). Maybe something like (using some of the language in Wikipedia) "the process of building derived values to describe observations in a dataset from the initial set of measured data, with the aim of creating a set of characteristics that is informative and non-redudant"? In your current definition, the idea of using this to reduce the required resources (in terms of memory storage or computational power, I guess) is often a nice side benefit, but the main goal is to create a "new" set of measurements for the observations that is derived from the original ones but in some way more helpful.
For "Poisson process", I recommend making "process" lowercase
For "spatial point process" and "Poisson process", I think that the first is a specific type of the second, so it would be nice for the definitions to make that a bit clearer. Let's see if @baileyfosdick has any suggestion on those two definitions and how we could make connections between the two clearer.
For "Ripley's K function", I recommend changing "and can help" to "that can help"
For "linear filter", I recommend adding at the beginning "A tool for"
For "binary images", could you either change the term to "binary image" or change the definition to start "images" instead of "an image" so that the term and definition agree on singular / plural?
I'm not sure that the definition for "morphological operations" has completely captured that idea. mathworks.com has some language that could be used in edits for this definition---maybe something like "image processing operations in which each pixel in the image is adjusted based on other pixels in its neighborhood"? (if you use that, be sure to add mathworks.com to the works cited section)

Hi Brooke,

Thank you for the helpful feedback! I still had issues redefining and understanding the spatial point process and Poisson process, but I tried to rework the definitions to make them more specific. I went through with the rest of you suggestions and changed the necessary definitions. Everything was pushed to my own repository, so hopefully it all makes it into the opened pull request.

Let me know if I need to change anything else today! Thanks, Camron

On Mon, Apr 27, 2020 at 6:57 PM Brooke Anderson notifications@github.com wrote:

@camronp https://github.com/camronp : Really nice start on this! I just had a few suggestions on some edits, all pretty minor. If you make these changes and push them to your own GitHub repo, then they should come through on the Pull Request as long as we have it open (send me an email when you push just in case so I can make sure it came through).

For "slots", I recommend changing the term to "slot", so it will agree with the definition in terms of plural / singular, and adding to the beginning "In the context of object-oriented programming in R,"

For "classification", I think they might mean that as a process, rather than as the resulting set. I suggest changing to something like "the process of grouping observations in a dataset by their similarities in terms of measured characteristics"

I think you're on the right track with "feature extraction", in terms of capturing why we often do it. However, I think the definition is missing a bit of the heart of the concept. In particular, it would be helpful to include the idea that this is a way of creating new measurements (i.e., new columns in a dataset) from the data you're given (or have measured). I think Wikipedia has some text that would be helpful in this definition (https://en.wikipedia.org/wiki/Feature_extraction). Maybe something like (using some of the language in Wikipedia) "the process of building derived values to describe observations in a dataset from the initial set of measured data, with the aim of creating a set of characteristics that is informative and non-redudant"? In your current definition, the idea of using this to reduce the required resources (in terms of memory storage or computational power, I guess) is often a nice side benefit, but the main goal is to create a "new" set of measurements for the observations that is derived from the original ones but in some way more helpful.

For "Poisson process", I recommend making "process" lowercase

For "spatial point process" and "Poisson process", I think that the first is a specific type of the second, so it would be nice for the definitions to make that a bit clearer. Let's see if @baileyfosdick https://github.com/baileyfosdick has any suggestion on those two definitions and how we could make connections between the two clearer.

For "Ripley's K function", I recommend changing "and can help" to "that can help"

For "linear filter", I recommend adding at the beginning "A tool for"

For "binary images", could you either change the term to "binary image" or change the definition to start "images" instead of "an image" so that the term and definition agree on singular / plural?

I'm not sure that the definition for "morphological operations" has completely captured that idea. mathworks.com has some language that could be used in edits for this definition---maybe something like "image processing operations in which each pixel in the image is adjusted based on other pixels in its neighborhood"? (if you use that, be sure to add mathworks.com to the works cited section)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/62#issuecomment-620313587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKTCICJEG2RO3TYS2D3S2ODROYSWBANCNFSM4MRP3SFQ .

@camronp Here are some suggestions for definitions for Poisson process and spatial point process -- Poisson process = mechanism that generates instantaneous events (in time and/or space) based on the Poisson distribution; spatial point process = mechanism that generates a random collection of coordinates or points randomly located along an underlying mathematical space. There is at most one point observed at any location.

@camronp : Great, the post is now up! Could you go back to Quizlet and update that with the latest version of the terms? You can copy the latest tsv from here: https://raw.githubusercontent.com/geanders/csu_msmb/master/content/post/vocab_lists/chapter_11.tsv

Of course! Just did it. I shouldn't need to change the embedding code right? Thanks! Camron

On Wed, Apr 29, 2020 at 1:06 PM Brooke Anderson notifications@github.com wrote:

@camronp https://github.com/camronp : Great, the post is now up! Could you go back to Quizlet and update that with the latest version of the terms? You can copy the latest tsv from here: https://raw.githubusercontent.com/geanders/csu_msmb/master/content/post/vocab_lists/chapter_11.tsv

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/62#issuecomment-621404708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKTCICKBLK6P47UH54CRS7DRPB3D5ANCNFSM4MRP3SFQ .

That’s right. When you embed, it’s just pointing to what’s posted somewhere else.

From: camronp notifications@github.com Sent: Wednesday, April 29, 2020 1:30 PM To: geanders/csu_msmb csu_msmb@noreply.github.com Cc: Anderson,Brooke Brooke.Anderson@colostate.edu; Comment comment@noreply.github.com Subject: Re: [geanders/csu_msmb] Vocab 11 Pull Request (#62)

Of course! Just did it. I shouldn't need to change the embedding code right? Thanks! Camron

On Wed, Apr 29, 2020 at 1:06 PM Brooke Anderson notifications@github.com wrote:

@camronp https://github.com/camronp : Great, the post is now up! Could you go back to Quizlet and update that with the latest version of the terms? You can copy the latest tsv from here: https://raw.githubusercontent.com/geanders/csu_msmb/master/content/post/vocab_lists/chapter_11.tsv

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/62#issuecomment-621404708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKTCICKBLK6P47UH54CRS7DRPB3D5ANCNFSM4MRP3SFQ .

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgeanders%2Fcsu_msmb%2Fpull%2F62%23issuecomment-621412520&data=02%7C01%7Cbrooke.anderson%40colostate.edu%7C1376e93be26148177f8008d7ec73d095%7Cafb58802ff7a4bb1ab21367ff2ecfc8b%7C0%7C0%7C637237854472098818&sdata=2nybYUg9Juy%2BGQb%2F%2FQncKN0ipwSZx%2F2wyjJl%2BzPDJ%2BY%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABKULCBZPWH6SV6PEHFM2N3RPB56LANCNFSM4MRP3SFQ&data=02%7C01%7Cbrooke.anderson%40colostate.edu%7C1376e93be26148177f8008d7ec73d095%7Cafb58802ff7a4bb1ab21367ff2ecfc8b%7C0%7C0%7C637237854472098818&sdata=TylxziSUfv2gekuNqfAKUEDJJ2PyKQyGzy4dhpMGSvA%3D&reserved=0.

@camronp : Alright, I know that the title of this pull request is for the vocab, but I think that it's also got your exercise post in it, which is why I've kept it open, and I see that it looks like you've done quite a bit on that. Is that post ready for me to take a look at and add some suggestions for edits?

@camronp : Alright, I know that the title of this pull request is for the vocab, but I think that it's also got your exercise post in it, which is why I've kept it open, and I see that it looks like you've done quite a bit on that. Is that post ready for me to take a look at and add some suggestions for edits?

@geanders I'm embarrassed to admit that I thought I had submitted this to you nearly a month ago! If you don't mind, lets use this as the exercise submission. Again, I apologize for the super late submission and am open to your edits and suggestions on the excercise! Thank

@camronp : No, that's my fault! I think that the exercise got incorporated into the vocab pull request, so I just wasn't sure if it was ready for review or not. I'll take a look and add my suggestions!

@camronp : Really nice work on this! I have some suggestions for edits. Most are to add discussion or explanations along the way, particularly in Part A. We'll leave this pull request open, and so once you make your edits and push them to your own GitHub repo, they should automatically come through the pull request as well. Please let me know if you have any questions about any of these comments.

It's wonderful that you put in the equation for the silhouette index. Could you define all the terms right after you show the equation, though? Otherwise, any reader who isn't familiar with this index probably wouldn't be able to "unpack" the equation and figure out what's going on.
For the code style, try to be consistent (at least within each blog post or other report) with style when there are two or more ways to do something. For example, you can use quotation marks are not in a library call, and either is fine, but the write-up will be cleaner if you always make the same choice. Given that, it's is great that you load all your libraries at the beginning, but could you edit so all those calls either do or don't use quotation marks? (Great job being consistent with your assignment operator (=) throughout, though!)
Edit throughout to make sure you're formatting and code or R object name (or package name) with backticks, so they'll stand out from the regular English. This helps readers a lot, because it helps their brains quickly process that a word is not in English and so they don't need to try to figure out what it it. Some examples of "R"-ish words in your write-up that aren't formatted with backticks but would be easier to read if they were: simdat, ggplot2, summary(sil_run). Also, in one sentence you put some words like this in quotation marks instead ('The “silhouette” function comes from the “cluster” package')---it might be cleaner to change those to backticks, as well. (Really nice job with using bold and italics formatting in the text, though.)
In a lot of places, it looks like you put the text explaining your code after the code chunk, rather than before. I think it might be helpful to readers if you swap that, so they've got an idea of what the code's trying to do before they start reading the code. Two examples of where you could move the text from after to before the code in this way include "The provided code is used to simulate data coming from four separate groups. They use the pipe operator to concatenate four different, randomly generated, data sets. The ggplot2 package is used to take a look at the data as a barchart with the k-means method and k = 4." and "Next up is the code necessary to plot the silhouette index. The “silhouette” function comes from the “cluster” package, and the resulting graph provides an average silhouette width for k = 4 clusters."
I love how you show the simulated data in a scatterplot for Part C, so the readers can see there are no groups. I think it would help if you also showed what the data looked like (both through showing a bit of the simdat dataframe and through a scatterplot) in Part A. I suggest that, in part A, you add something like ggplot(simdat, aes(x = x, y = y, color = class)) + geom_point() to show how the original simdat data is simulated to have four separate groups (and maybe also something like head(simdat) to show what this data looks like after you simulate it). The code for simulating the data is a bit dense, with an lapply inside an lapply and two function definitions inside the object definition code, so I think it would be great to help readers "see" what we end up with after that code, since many probably can't process it in their heads.
In that first big chunk of code in part A, could you add some more explanation about what's going on with each step? You could do this either in the text before the code, break the code up into more chunks and add text explanation throughout, or add some code comments within the code chunk. Some of the things to explain more would be: (1) why do you need to set a seed? what's that doing? (2) How are those two lapplys working in the code to create the simdat dataset? And what is ultimately in the simdat dataset (in terms of what each row and column represent and its dimensions)? And what are the mx and my in this call? (It looks to me like this code walks through the four possible two-way combinations of 0 and 8---0,0; 0,8; 8,0; and 8,8---and then simulates a dataframe with 100 points with x- and y-values randomly distributed around whichever combo of 0 and 8 it's on. That ends up with four dataframes that get stuck together into one big one with the bind_rows. You explain this last bit in the text under the chunk, but it might be helpful for readers to understand the first part, too, of how this code is walking through the 0-8 combos and generating random datasets around each of the four combos, giving random data with four clear clusters.) (3) What is the wss dataframe? It looks like you set it up as a one-column dataframe with blank (NA) values, and then the code is adding in some values as it processes---it would be helpful to explain what we're trying to get in this dataframe through the code and how we're calculating it. It looks like it's maybe giving the within-cluster sum of squares from running k-means clustering on this random data, moving from using one group (k=1) up to eight (k=8)? Any ideas on why things keep looking better as you keep adding to the number of clusters (k)? Ideally, if we know the data have four clusters (and we do, because we made it that way), it seems like we would want some measure that helps us see that 4 is the "right" number of clusters to pick when clustering the data, wouldn't it? (4) Why do we need to calculate value in the first row of wss differently than we calculate this value in the other rows? In the first row, we're calculating it as sum(scale(simdatxy, scale = FALSE)^2), but then for all the others we're running a kmeans and then pulling out the withinss value for all the observations and summing it. (5) Could you add a bit more on the overall aim of the code in this chunk? It looks like we're maybe trying to see how the within-cluster sum of squares, estimated for a dataset that's made up of four clusters of 100 observations each, changes as you increase the number of clusters from 1 to 8 when clustering using k-means. I think it would be helpful for readers to understand that this is what we're aiming for, and the results of this are what's being shown in the first plot generated by this chunk of code. Also, is there some relationship between the within-cluster sum of squares that you show in the first plot here and the silhouette index that you calculate next? In general, could you clarify why we're checking out this first plot before we move on to the silhouette index?
I think there's a small error in this statement: "The ggplot2 package is used to take a look at the data as a barchart with the k-means method and k = 4." I think that you're checking for not just k = 4, but for every value of k between 1 and 8 (that's what's being labeled in the x-axis---the choice of k).
For the second code chunk in Part A, could you explain what pam is doing? (I think it's doing pretty much the same thing as the kmeans call from the earlier code chunk, but maybe using the cluster package's algorithm for k-means clustering? Maybe you have to use it here because the silhouette function can only be run on objects of the type output by pam? I'm not sure, but some follow-up and explanation would be really helpful here.)
For the plots that you make with the silhouette index, both in parts A and B, sometimes they're showing up as empty in the final blog post output. I think that this is a matter of the dimensions of the graph area, because if I get that and then try making the graph bigger, it all of a sudden shows up. Could you play around with fig.width and fig.height in the code chunk options where you're making these plots to make sure that they don't show up as blank?
Also, these plots are a bit complex, with a lot of labeling around and all that. It would be really helpful if you could add some text to explain to the readers how to walk through and interpret these. For example, you could explain that the n = under the title is saying how many observations are in the original data (400 points, in this case). On the right side, they're telling us how many clusters were used in the clustering algorithm, and I think it's also maybe saying how many of the observations are being put in each of these four clusters (for example, 103 observations were assigned to cluster 1)? Then maybe there's a number giving the average silhouette index value for the points assigned to that cluster? The helpfile for the plot method for silhouette (?plot.silhouette) might help some in figuring out how these plots should be interpreted. (That helpfile has this quote, which seems helpful: "Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.") In this plot in part A, it looks like there are a couple of points with negative values of s (their bars go the opposite direction for all the others in this silhouette index plot), but then most are pretty easy to distinguish as belonging to the cluster where they were assigned (silhouette values that are positive and pretty far from 0).
For Part A, I think it might be really interesting to show through a plot each of the simulated values, as well as its (1) cluster assignment based on pam k-means clustering and (2) its silhouette index, helping to show how well it looks like it belongs to that cluster. One way of doing that might be:

sil %>% 
  unclass() %>% 
  as.data.frame() %>% 
  tibble::rownames_to_column(var = "orig_order") %>% 
  arrange(as.numeric(orig_order)) %>% 
  bind_cols(simdat) %>% 
  ggplot(aes(x = x, y = y, shape = as.factor(cluster), color = sil_width)) + 
  geom_point() + 
  facet_wrap(~ class)

In this, there's a facet for each of the "true" groupings of the points (the four groups you originally simulated). Then, the shape shows the group it was assigned to by k-means (with the pam call). For the most part, all the points that were part of a true original group are assigned to the same cluster, although occasionally ones along the border with another cluster are mis-assigned. For the silhouette index, you can see from this that it gets close to zero (or even negative) when you get close to those borders between the original groups (around where x and y equal 4--5), but then are nice and high in the middle of the x-y space for a cluster and on edges that are away from any other cluster.

In Part A, you have a section on "Computing the Silhouette Index". Just to be clear, you've already computed the silhouette index for each of the observations with the line earlier, sil = silhouette(pam4, 4, border = NA). In this section, when you run summary(sil)$avg.width, I think you're getting the average silhouette index across all the observations in the dataset, right? So this might be a metric you could use to summarize things at the level of the full clustering process, but then you're also getting some information for each specific observation that gives you an idea of how well it looks like they belong to the cluster they were assigned to. (Or are the calculations for each observation the silhouette distances, and then this observation-wide summary the silhouette index? If so, it would be helpful to clarify that terminology for the readers.)
You have a really nice summary of the silhouette index: "The silhouette value is a measure of how similar a cluster is to its own cluster when compared to other clusters. Values can range from -1 to +1. A high value tells us that the object is better matched to its on cluster and more poorly matched to neighboring clusters." It would be helpful to have this explanation earlier---consider moving this text up to near where you have the equation in the start of your post.
In Part B, I love the loop!! Also, I really like how you walk through a more "brute force" check, by looking at each output one-by-one, and then move into this more automated way of checking. For this section, my only suggestion is to add a little paragraph at the end to talk about why we think we're getting the best silhouette index value for k = 4. I'm guessing that it's because there really are four groups in the data, based on how we created it. While the within-cluster sum of squares that you show in the plot in Part A just keep going down as you add more clusters, it looks like the silhouette index is doing a good job of helping to identify the "just right" amount of clusters for the dataset, which is very helpful. If you could add some discussion on this point to end Part B, I think that would be great and help tie it with more of what you show in Part A.
I think Part C follows up really nicely from the first two parts. My only suggestion there is to maybe also mention how, based on this figure, there are several points that end up with negative silhouette widths in this case (8 it looks like), so not just are there lower values in general, but there also seem to be more cases where, based on this metric, it looks like the observation was assigned to the wrong group.

@geanders Thank you so much for all of your suggestions and edits. I spent quite a bit of time changing and improving the exercise 5 document with your recommendations. I just pushed it up to my repository and it looks like it made it the pull request.

@camronp : Really, really nice work on these revisions! It's now live here: https://kind-neumann-789611.netlify.app/post/exercise-solution-for-5-1/

geanders / csu_msmb

Vocab 11 Pull Request #62