bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Improvement of word networ visualisation? #23

Closed rdatasculptor closed 6 years ago

rdatasculptor commented 6 years ago

First of all: thank you for make udpipe available in R! It's a great package.

I was looking at your example network visualisations and I was wondering if they could be improved by not only showing the different edge sizes but also by showing different word (node) sizes dependent on the sum of all edge sizes (of the edges linked to the node). To my opinion this would be resulting in a wordcloud 2.0. I am curious about your opinion about this.

jwijffels commented 6 years ago

Yes, I completely agree on that. I have such a function locally available as part of another package (which is not distributed however). I believe plotting functionalities should be put into another R package as plotting functionalities are not udpipe specific but should work for other text mining R packages also.

rdatasculptor commented 6 years ago

Nice to hear you agree with that. I also agree it shouldn't be part of udpipe itself. I think it is possible with ggraph, the package you use in your example visualisations, but I haven't figured out yet how (to implement that in e.g. cooccurence visualisation.). Any plans in using your function in your visualisations?

jwijffels commented 6 years ago

The tidygraphr package has some measure of centrality as well as the igraphpackage if you want to obtain that or you can just start at the help of ?geom_node_text to specify the size of each node currently i have no plans to include such plots inside the package - i even was thinking more to use other graph packages than ggraph (https://github.com/iankloo/sigmaNet in particular) but it's low on my todo list - feel free to contribute if you have time available.

rdatasculptor commented 6 years ago

I will give it a try when In have time! That will not be before next week.

jwijffels commented 6 years ago

Here you have some inspiration. Feel free to report what you finally have come up with.

library(udpipe)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")

ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback, doc_id = comments$id)
x <- as.data.frame(x)

cooc <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")), 
                     term = "lemma", 
                     group = c("doc_id", "paragraph_id", "sentence_id"))
head(cooc)

nodes <- txt_freq(subset(x, upos %in% c("NOUN", "ADJ"))$lemma)
nodes$name <- nodes$key
nodes$nodesize <- nodes$freq

library(igraph)
library(ggraph)
library(ggplot2)
wordnetwork <- head(cooc, 30)
wordnetwork <- graph_from_data_frame(wordnetwork, 
                                     vertices=subset(nodes, name %in% c(wordnetwork$term1, wordnetwork$term2)))
ggraph(wordnetwork, layout = "fr") +
  geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
  geom_node_text(aes(label = name, size = nodesize), col = "darkgreen") +
  theme_graph(base_family = "Arial Narrow") +
  theme(legend.position = "none") +
  labs(title = "Cooccurrences within sentence", subtitle = "Nouns & Adjective")
rdatasculptor commented 6 years ago

Thank you for this inspiration! I think this code is at least 90% of what I had in mind. Monday or tuesday I will be back at my laptop, and try to work on this code. I will let you know the result.

benmolineaux commented 6 years ago

Down in 5

On 5 May 2018, at 17:33, rdatasculptor notifications@github.com<mailto:notifications@github.com> wrote:

Thank you for this inspiration! I think this code is at least 90% of what I had in mind. Monday or tuesday I will be back at my laptop, and try to work on this code. I will let you know the result.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/bnosac/udpipe/issues/23#issuecomment-386817773, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AK99f5ag3oFaPdEO28oZ6fR9r05lP_5Lks5tvdRjgaJpZM4Tx3JD.

The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

rdatasculptor commented 6 years ago

I think the code is very good. I altered it a little bit to get the nodesize that (to my opinion) fits the network selection better. This way it only uses de word frequency within this selection of the top 30 cooccurrences.

library(udpipe)
data(brussels_reviews)
comments <- subset(brussels_reviews, language %in% "es")

ud_model <- udpipe_download_model(language = "spanish")
ud_model <- udpipe_load_model(ud_model$file_model)
x <- udpipe_annotate(ud_model, x = comments$feedback, doc_id = comments$id)
x <- as.data.frame(x)

cooc <- cooccurrence(x = subset(x, upos %in% c("NOUN", "ADJ")), 
                     term = "lemma", 
                     group = c("doc_id", "paragraph_id", "sentence_id"))
head(cooc)

#nodes <- txt_freq(subset(x, upos %in% c("NOUN", "ADJ"))$lemma)
#nodes$name <- nodes$key
#nodes$nodesize <- nodes$freq

library(igraph)
library(ggraph)
library(ggplot2)
library(dplyr)
wordnetwork <- head(cooc, 30)
nodes1 <- data.frame(name=wordnetwork$term1,freq=wordnetwork$cooc)
nodes2 <- data.frame(name=wordnetwork$term2,freq=wordnetwork$cooc)
nodes <- group_by(bind_rows(nodes1,nodes2),name)
nodes <- summarise(nodes, nodesize=sum(freq))

wordnetwork <- graph_from_data_frame(wordnetwork, 
                                     vertices=subset(nodes, name %in% c(wordnetwork$term1, wordnetwork$term2)))
ggraph(wordnetwork, layout = "fr") +
  geom_edge_link(aes(width = cooc, edge_alpha = cooc), edge_colour = "pink") +
  geom_node_text(aes(label = name, size = nodesize), col = "darkgreen",repel=TRUE) +
  theme_graph(base_family = "Arial Narrow") +
  theme(legend.position = "none") +
  labs(title = "Cooccurrences within sentence", subtitle = "Nouns & Adjective")
jwijffels commented 6 years ago

Thanks for the feedback on what you've come up with.