GIS4DEV / GIS4DEV.github.io

Open Source GIScience & GIS for Development
1 stars 12 forks source link

paired_words #10

Closed chriskgernon closed 4 years ago

chriskgernon commented 4 years ago

I am going over the R code and I cannot figure out where "paired_words" comes from in the unnest_tokens() function or what it does.

` winterWordPairs <- winterTweetsGeo %>% select(text) %>% mutate(text = removeWords(text, stop_words$word)) %>% unnest_tokens(paired_words, text, token = "ngrams", n = 2)

winterWordPairs <- separate(winterWordPairs, paired_words, c("word1", "word2"),sep=" ") winterWordPairs <- winterWordPairs %>% count(word1, word2, sort=TRUE)

graph a word cloud with space indicating association. you may change the filter to filter more or less than pairs with 10 instances

winterWordPairs %>% filter(n >= 5) %>% # we changed this to 2, rather than 15 graph_from_data_frame() %>% ggraph(layout = "fr") +

geom_edge_link(aes(edge_alpha = n, edge_width = n)) +

geom_node_point(color = "darkslategray4", size = 3) + geom_node_text(aes(label = name), vjust = 1.8, size = 3) + labs(title = "Word Network: Tweets during the 2013 Colorado Flood Event", subtitle = "September 2013 - Text mining twitter data ", x = "", y = "") + theme_void()`

josephholler commented 4 years ago

The confusion is probably the special R symbol for piping: %>% Essentially if you pipe %>% one function into another, it means the first parameter of the next function is the result of the previous. In this specific example, unnest_tokens() link takes this form:

unnest_tokens(tbl, output, input, token = "words", format = c("text",
  "man", "latex", "html", "xml"), to_lower = TRUE, drop = TRUE,
  collapse = NULL, ...)

because it's been used in the context of a pipe %>% the first parameter of unnest_tokens (a tbl table), is actually output of the previous function, which in this case is the mutate() function to remove stop words.

You could also write that block of functions somewhat like this:

unnest_tokens( select( winterWordPairs , text), mutate( text = removeWords(text, stop_words$word)) , paired_words, text, token = "ngrams", n = 2)