Closed camronp closed 4 years ago
@camronp : Really nice start on this! I just had a few suggestions on some edits, all pretty minor. If you make these changes and push them to your own GitHub repo, then they should come through on the Pull Request as long as we have it open (send me an email when you push just in case so I can make sure it came through).
Hi Brooke,
Thank you for the helpful feedback! I still had issues redefining and understanding the spatial point process and Poisson process, but I tried to rework the definitions to make them more specific. I went through with the rest of you suggestions and changed the necessary definitions. Everything was pushed to my own repository, so hopefully it all makes it into the opened pull request.
Let me know if I need to change anything else today! Thanks, Camron
On Mon, Apr 27, 2020 at 6:57 PM Brooke Anderson notifications@github.com wrote:
@camronp https://github.com/camronp : Really nice start on this! I just had a few suggestions on some edits, all pretty minor. If you make these changes and push them to your own GitHub repo, then they should come through on the Pull Request as long as we have it open (send me an email when you push just in case so I can make sure it came through).
- For "slots", I recommend changing the term to "slot", so it will agree with the definition in terms of plural / singular, and adding to the beginning "In the context of object-oriented programming in R,"
- For "classification", I think they might mean that as a process, rather than as the resulting set. I suggest changing to something like "the process of grouping observations in a dataset by their similarities in terms of measured characteristics"
- I think you're on the right track with "feature extraction", in terms of capturing why we often do it. However, I think the definition is missing a bit of the heart of the concept. In particular, it would be helpful to include the idea that this is a way of creating new measurements (i.e., new columns in a dataset) from the data you're given (or have measured). I think Wikipedia has some text that would be helpful in this definition (https://en.wikipedia.org/wiki/Feature_extraction). Maybe something like (using some of the language in Wikipedia) "the process of building derived values to describe observations in a dataset from the initial set of measured data, with the aim of creating a set of characteristics that is informative and non-redudant"? In your current definition, the idea of using this to reduce the required resources (in terms of memory storage or computational power, I guess) is often a nice side benefit, but the main goal is to create a "new" set of measurements for the observations that is derived from the original ones but in some way more helpful.
- For "Poisson process", I recommend making "process" lowercase
- For "spatial point process" and "Poisson process", I think that the first is a specific type of the second, so it would be nice for the definitions to make that a bit clearer. Let's see if @baileyfosdick https://github.com/baileyfosdick has any suggestion on those two definitions and how we could make connections between the two clearer.
- For "Ripley's K function", I recommend changing "and can help" to "that can help"
- For "linear filter", I recommend adding at the beginning "A tool for"
- For "binary images", could you either change the term to "binary image" or change the definition to start "images" instead of "an image" so that the term and definition agree on singular / plural?
- I'm not sure that the definition for "morphological operations" has completely captured that idea. mathworks.com has some language that could be used in edits for this definition---maybe something like "image processing operations in which each pixel in the image is adjusted based on other pixels in its neighborhood"? (if you use that, be sure to add mathworks.com to the works cited section)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/62#issuecomment-620313587, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKTCICJEG2RO3TYS2D3S2ODROYSWBANCNFSM4MRP3SFQ .
@camronp Here are some suggestions for definitions for Poisson process and spatial point process -- Poisson process = mechanism that generates instantaneous events (in time and/or space) based on the Poisson distribution; spatial point process = mechanism that generates a random collection of coordinates or points randomly located along an underlying mathematical space. There is at most one point observed at any location.
@camronp : Great, the post is now up! Could you go back to Quizlet and update that with the latest version of the terms? You can copy the latest tsv from here: https://raw.githubusercontent.com/geanders/csu_msmb/master/content/post/vocab_lists/chapter_11.tsv
Of course! Just did it. I shouldn't need to change the embedding code right? Thanks! Camron
On Wed, Apr 29, 2020 at 1:06 PM Brooke Anderson notifications@github.com wrote:
@camronp https://github.com/camronp : Great, the post is now up! Could you go back to Quizlet and update that with the latest version of the terms? You can copy the latest tsv from here: https://raw.githubusercontent.com/geanders/csu_msmb/master/content/post/vocab_lists/chapter_11.tsv
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/62#issuecomment-621404708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKTCICKBLK6P47UH54CRS7DRPB3D5ANCNFSM4MRP3SFQ .
That’s right. When you embed, it’s just pointing to what’s posted somewhere else.
From: camronp notifications@github.com Sent: Wednesday, April 29, 2020 1:30 PM To: geanders/csu_msmb csu_msmb@noreply.github.com Cc: Anderson,Brooke Brooke.Anderson@colostate.edu; Comment comment@noreply.github.com Subject: Re: [geanders/csu_msmb] Vocab 11 Pull Request (#62)
Of course! Just did it. I shouldn't need to change the embedding code right? Thanks! Camron
On Wed, Apr 29, 2020 at 1:06 PM Brooke Anderson notifications@github.com wrote:
@camronp https://github.com/camronp : Great, the post is now up! Could you go back to Quizlet and update that with the latest version of the terms? You can copy the latest tsv from here: https://raw.githubusercontent.com/geanders/csu_msmb/master/content/post/vocab_lists/chapter_11.tsv
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/geanders/csu_msmb/pull/62#issuecomment-621404708, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKTCICKBLK6P47UH54CRS7DRPB3D5ANCNFSM4MRP3SFQ .
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgeanders%2Fcsu_msmb%2Fpull%2F62%23issuecomment-621412520&data=02%7C01%7Cbrooke.anderson%40colostate.edu%7C1376e93be26148177f8008d7ec73d095%7Cafb58802ff7a4bb1ab21367ff2ecfc8b%7C0%7C0%7C637237854472098818&sdata=2nybYUg9Juy%2BGQb%2F%2FQncKN0ipwSZx%2F2wyjJl%2BzPDJ%2BY%3D&reserved=0, or unsubscribehttps://nam01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABKULCBZPWH6SV6PEHFM2N3RPB56LANCNFSM4MRP3SFQ&data=02%7C01%7Cbrooke.anderson%40colostate.edu%7C1376e93be26148177f8008d7ec73d095%7Cafb58802ff7a4bb1ab21367ff2ecfc8b%7C0%7C0%7C637237854472098818&sdata=TylxziSUfv2gekuNqfAKUEDJJ2PyKQyGzy4dhpMGSvA%3D&reserved=0.
@camronp : Alright, I know that the title of this pull request is for the vocab, but I think that it's also got your exercise post in it, which is why I've kept it open, and I see that it looks like you've done quite a bit on that. Is that post ready for me to take a look at and add some suggestions for edits?
@camronp : Alright, I know that the title of this pull request is for the vocab, but I think that it's also got your exercise post in it, which is why I've kept it open, and I see that it looks like you've done quite a bit on that. Is that post ready for me to take a look at and add some suggestions for edits?
@geanders I'm embarrassed to admit that I thought I had submitted this to you nearly a month ago! If you don't mind, lets use this as the exercise submission. Again, I apologize for the super late submission and am open to your edits and suggestions on the excercise! Thank
@camronp : No, that's my fault! I think that the exercise got incorporated into the vocab pull request, so I just wasn't sure if it was ready for review or not. I'll take a look and add my suggestions!
@camronp : Really nice work on this! I have some suggestions for edits. Most are to add discussion or explanations along the way, particularly in Part A. We'll leave this pull request open, and so once you make your edits and push them to your own GitHub repo, they should automatically come through the pull request as well. Please let me know if you have any questions about any of these comments.
library
call, and either is fine, but the write-up will be cleaner if you always make the same choice. Given that, it's is great that you load all your libraries at the beginning, but could you edit so all those calls either do or don't use quotation marks? (Great job being consistent with your assignment operator (=
) throughout, though!)simdat
dataframe and through a scatterplot) in Part A. I suggest that, in part A, you add something like ggplot(simdat, aes(x = x, y = y, color = class)) + geom_point()
to show how the original simdat
data is simulated to have four separate groups (and maybe also something like head(simdat)
to show what this data looks like after you simulate it). The code for simulating the data is a bit dense, with an lapply
inside an lapply
and two function definitions inside the object definition code, so I think it would be great to help readers "see" what we end up with after that code, since many probably can't process it in their heads. lapply
s working in the code to create the simdat
dataset? And what is ultimately in the simdat
dataset (in terms of what each row and column represent and its dimensions)? And what are the mx
and my
in this call? (It looks to me like this code walks through the four possible two-way combinations of 0 and 8---0,0; 0,8; 8,0; and 8,8---and then simulates a dataframe with 100 points with x- and y-values randomly distributed around whichever combo of 0 and 8 it's on. That ends up with four dataframes that get stuck together into one big one with the bind_rows
. You explain this last bit in the text under the chunk, but it might be helpful for readers to understand the first part, too, of how this code is walking through the 0-8 combos and generating random datasets around each of the four combos, giving random data with four clear clusters.) (3) What is the wss
dataframe? It looks like you set it up as a one-column dataframe with blank (NA
) values, and then the code is adding in some values as it processes---it would be helpful to explain what we're trying to get in this dataframe through the code and how we're calculating it. It looks like it's maybe giving the within-cluster sum of squares from running k-means clustering on this random data, moving from using one group (k=1
) up to eight (k=8
)? Any ideas on why things keep looking better as you keep adding to the number of clusters (k
)? Ideally, if we know the data have four clusters (and we do, because we made it that way), it seems like we would want some measure that helps us see that 4 is the "right" number of clusters to pick when clustering the data, wouldn't it? (4) Why do we need to calculate value
in the first row of wss
differently than we calculate this value in the other rows? In the first row, we're calculating it as sum(scale(simdatxy, scale = FALSE)^2)
, but then for all the others we're running a kmeans
and then pulling out the withinss
value for all the observations and summing it. (5) Could you add a bit more on the overall aim of the code in this chunk? It looks like we're maybe trying to see how the within-cluster sum of squares, estimated for a dataset that's made up of four clusters of 100 observations each, changes as you increase the number of clusters from 1 to 8 when clustering using k-means. I think it would be helpful for readers to understand that this is what we're aiming for, and the results of this are what's being shown in the first plot generated by this chunk of code. Also, is there some relationship between the within-cluster sum of squares that you show in the first plot here and the silhouette index that you calculate next? In general, could you clarify why we're checking out this first plot before we move on to the silhouette index?pam
is doing? (I think it's doing pretty much the same thing as the kmeans
call from the earlier code chunk, but maybe using the cluster
package's algorithm for k-means clustering? Maybe you have to use it here because the silhouette
function can only be run on objects of the type output by pam
? I'm not sure, but some follow-up and explanation would be really helpful here.)fig.width
and fig.height
in the code chunk options where you're making these plots to make sure that they don't show up as blank? n =
under the title is saying how many observations are in the original data (400 points, in this case). On the right side, they're telling us how many clusters were used in the clustering algorithm, and I think it's also maybe saying how many of the observations are being put in each of these four clusters (for example, 103 observations were assigned to cluster 1)? Then maybe there's a number giving the average silhouette index value for the points assigned to that cluster? The helpfile for the plot
method for silhouette
(?plot.silhouette
) might help some in figuring out how these plots should be interpreted. (That helpfile has this quote, which seems helpful: "Observations with a large s(i) (almost 1) are very well clustered, a small s(i) (around 0) means that the observation lies between two clusters, and observations with a negative s(i) are probably placed in the wrong cluster.") In this plot in part A, it looks like there are a couple of points with negative values of s (their bars go the opposite direction for all the others in this silhouette index plot), but then most are pretty easy to distinguish as belonging to the cluster where they were assigned (silhouette values that are positive and pretty far from 0). pam
k-means clustering and (2) its silhouette index, helping to show how well it looks like it belongs to that cluster. One way of doing that might be: sil %>%
unclass() %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "orig_order") %>%
arrange(as.numeric(orig_order)) %>%
bind_cols(simdat) %>%
ggplot(aes(x = x, y = y, shape = as.factor(cluster), color = sil_width)) +
geom_point() +
facet_wrap(~ class)
In this, there's a facet for each of the "true" groupings of the points (the four groups you originally simulated). Then, the shape shows the group it was assigned to by k-means (with the pam
call). For the most part, all the points that were part of a true original group are assigned to the same cluster, although occasionally ones along the border with another cluster are mis-assigned. For the silhouette index, you can see from this that it gets close to zero (or even negative) when you get close to those borders between the original groups (around where x and y equal 4--5), but then are nice and high in the middle of the x-y space for a cluster and on edges that are away from any other cluster.
sil = silhouette(pam4, 4, border = NA)
. In this section, when you run summary(sil)$avg.width
, I think you're getting the average silhouette index across all the observations in the dataset, right? So this might be a metric you could use to summarize things at the level of the full clustering process, but then you're also getting some information for each specific observation that gives you an idea of how well it looks like they belong to the cluster they were assigned to. (Or are the calculations for each observation the silhouette distances, and then this observation-wide summary the silhouette index? If so, it would be helpful to clarify that terminology for the readers.)@geanders Thank you so much for all of your suggestions and edits. I spent quite a bit of time changing and improving the exercise 5 document with your recommendations. I just pushed it up to my repository and it looks like it made it the pull request.
@camronp : Really, really nice work on these revisions! It's now live here: https://kind-neumann-789611.netlify.app/post/exercise-solution-for-5-1/
Here is the vocab 11. I'm not sure if the "2020-04-13-vocabulary-for-chapter-11.Rmd" is in the proper folder or not. Let me know if you have any issues! There was also a lot of things the commited that I'm not entirely sure where they came from... but hopefully it makes sense to you!