bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

how to use the dependency parsing? #11

Closed randomgambit closed 6 years ago

randomgambit commented 6 years ago

Hello,

Thanks again for this great lightweight package.

Something I quite dont get is how to get the dependency tree with it. For instance, consider this simple example

library(udpipe)
dl <- udpipe_download_model(language = "english")
udmodel_en <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")

x <- udpipe_annotate(udmodel_en, 
                     x = "the economy is weak but the outloook is bright")
as.data.frame(x)

> as.data.frame(x)
  doc_id paragraph_id sentence_id                                       sentence token_id    token    lemma  upos xpos
1   doc1            1           1 the economy is weak but the outloook is bright        1      the      the   DET   DT
2   doc1            1           1 the economy is weak but the outloook is bright        2  economy  economy  NOUN   NN
3   doc1            1           1 the economy is weak but the outloook is bright        3       is       be   AUX  VBZ
4   doc1            1           1 the economy is weak but the outloook is bright        4     weak     weak   ADJ   JJ
5   doc1            1           1 the economy is weak but the outloook is bright        5      but      but CCONJ   CC
6   doc1            1           1 the economy is weak but the outloook is bright        6      the      the   DET   DT
7   doc1            1           1 the economy is weak but the outloook is bright        7 outloook outloook  NOUN   NN
8   doc1            1           1 the economy is weak but the outloook is bright        8       is       be   AUX  VBZ
9   doc1            1           1 the economy is weak but the outloook is bright        9   bright   bright   ADJ   JJ
                                                  feats head_token_id dep_rel deps            misc
1                             Definite=Def|PronType=Art             2     det <NA>            <NA>
2                                           Number=Sing             4   nsubj <NA>            <NA>
3 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4     cop <NA>            <NA>
4                                            Degree=Pos             0    root <NA>            <NA>
5                                                  <NA>             9      cc <NA>            <NA>
6                             Definite=Def|PronType=Art             7     det <NA>            <NA>
7                                           Number=Sing             9   nsubj <NA>            <NA>
8 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             9     cop <NA>            <NA>
9                                            Degree=Pos             4    conj <NA> SpacesAfter=\\n

From this output I do not see how I can associate weak to economy and bright to outlook. Am I missing something with this package?

thanks!!

jwijffels commented 6 years ago

Tokens are linked to each other by the token_id and head_token_id. The type of dependency relationship indicating how the words are linked is defined in dep_rel

See http://universaldependencies.org/guidelines.html for details on the output and all possible values of upos/xpos/feats and dep_rel

So for your example it shows that

Similar comment for bright and outloook. You'll see in the table that outloook is the nominal subject of bright.

If you want to visualise this, you can easily use the igraph R package to visualise the word network which follows from this.

So this means that with dependency parsing output you can easily answer the questions like 'What is bright?' Answer: outloook 'What is weak?' Answer: economy

randomgambit commented 6 years ago

thanks! extremely helpful! This package is amazing!!

Is there a tutorial somewhere to use the igraph package with udpipe?

Also, I remember you wanted to do some performance comparisons with spacy and the other competitors. Have you got the time to do it?

Thanks!

jwijffels commented 6 years ago

To my knowledge there is no such tutorial. I was hoping the NLP R community will build upon the output to generate whichever they had in mind using the wealth of existing R packages.

About spacy, I've made a comparison between spacy and udpipe yesterday here: https://github.com/jwijffels/udpipe-spacy-comparison - comparing mainly accuracy.

randomgambit commented 6 years ago

OK thanks! Will have a look shortly

jwijffels commented 6 years ago

FYI. Below some examples on how to put the dependency network in a graph.

library(udpipe)
udpipe_download_model("english")
m <- udpipe_load_model("english-ud-2.0-170801.udpipe")
x <- udpipe_annotate(m, "The economy is weak but the outlook is bright")
x <- as.data.frame(x)
library(igraph)
edges <- subset(x, head_token_id != 0, select = c("token_id", "head_token_id", "dep_rel"))
edges$label <- edges$dep_rel
g <- graph_from_data_frame(edges,
                           vertices = x[, c("token_id", "token", "lemma", "upos", "xpos", "feats")], 
                           directed = TRUE)
plot(g, vertex.label = x$token)

dep_example1

library(ggraph)
library(ggplot2)
ggraph(g, layout = "fr") +
  geom_edge_link(aes(label = dep_rel), arrow = arrow(length = unit(4, 'mm')), end_cap = circle(3, 'mm')) + 
  geom_node_point(color = "lightblue", size = 5) +
  theme_void(base_family = "") +
  geom_node_text(ggplot2::aes(label = token), vjust = 1.8) +
  ggtitle("Showing dependencies")

More graphical visualisations here: https://www.data-imaginist.com/2017/ggraph-introduction-edges/

dep_example2

randomgambit commented 6 years ago

really amazing, thanks!!!

arademaker commented 6 years ago

I suspect you can have much better results with graphviz , see https://eli.thegreenplace.net/2009/11/23/visualizing-binary-trees-with-graphviz

randomgambit commented 6 years ago

@jwijffels I guess one big picture question is: which textbooks/books/resources do you recommend to get the most out of this package (and NLP in general?). The resources on http://universaldependencies.org/u/dep/index.html are ... very light

jwijffels commented 6 years ago

I believe the documentation at universaldependencies.org is rather good. The question you have is what can you do with the output? Let me list up some elements which I can directly come up with.

Use cases of pos tagging & lemmatisation

Use cases of dependency parsing

If you have other idea's what can be done with the annotation results, feel free to add. About courses. Follow the course at https://lstat.kuleuven.be/training/coursedescriptions/text-mining-with-r, it's given by me so that answer is opinionated. About books, it's hard to find good ones on the topic of text mining.

randomgambit commented 6 years ago

thanks @jwijffels ! very clear - I was hoping that your course was free material though :) hehe