how to use the dependency parsing?

randomgambit commented 6 years ago

Hello,

Thanks again for this great lightweight package.

Something I quite dont get is how to get the dependency tree with it. For instance, consider this simple example

library(udpipe)
dl <- udpipe_download_model(language = "english")
udmodel_en <- udpipe_load_model(file = "english-ud-2.0-170801.udpipe")

x <- udpipe_annotate(udmodel_en, 
                     x = "the economy is weak but the outloook is bright")
as.data.frame(x)

> as.data.frame(x)
  doc_id paragraph_id sentence_id                                       sentence token_id    token    lemma  upos xpos
1   doc1            1           1 the economy is weak but the outloook is bright        1      the      the   DET   DT
2   doc1            1           1 the economy is weak but the outloook is bright        2  economy  economy  NOUN   NN
3   doc1            1           1 the economy is weak but the outloook is bright        3       is       be   AUX  VBZ
4   doc1            1           1 the economy is weak but the outloook is bright        4     weak     weak   ADJ   JJ
5   doc1            1           1 the economy is weak but the outloook is bright        5      but      but CCONJ   CC
6   doc1            1           1 the economy is weak but the outloook is bright        6      the      the   DET   DT
7   doc1            1           1 the economy is weak but the outloook is bright        7 outloook outloook  NOUN   NN
8   doc1            1           1 the economy is weak but the outloook is bright        8       is       be   AUX  VBZ
9   doc1            1           1 the economy is weak but the outloook is bright        9   bright   bright   ADJ   JJ
                                                  feats head_token_id dep_rel deps            misc
1                             Definite=Def|PronType=Art             2     det <NA>            <NA>
2                                           Number=Sing             4   nsubj <NA>            <NA>
3 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             4     cop <NA>            <NA>
4                                            Degree=Pos             0    root <NA>            <NA>
5                                                  <NA>             9      cc <NA>            <NA>
6                             Definite=Def|PronType=Art             7     det <NA>            <NA>
7                                           Number=Sing             9   nsubj <NA>            <NA>
8 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin             9     cop <NA>            <NA>
9                                            Degree=Pos             4    conj <NA> SpacesAfter=\\n

From this output I do not see how I can associate weak to economy and bright to outlook. Am I missing something with this package?

thanks!!

jwijffels commented 6 years ago

Tokens are linked to each other by the token_id and head_token_id. The type of dependency relationship indicating how the words are linked is defined in dep_rel

See http://universaldependencies.org/guidelines.html for details on the output and all possible values of upos/xpos/feats and dep_rel

So for your example it shows that

the term weak has token_id 4, economy has a dependency relationship to weak as the head_token_id for economy is 4 (which is the token_id of weak).
the type of relationship given in dep_rel shows nsubj. nsubj means nominal subject as defined in http://universaldependencies.org/u/dep/index.html so economy is the nominal subject of weak

Similar comment for bright and outloook. You'll see in the table that outloook is the nominal subject of bright.

If you want to visualise this, you can easily use the igraph R package to visualise the word network which follows from this.

So this means that with dependency parsing output you can easily answer the questions like 'What is bright?' Answer: outloook 'What is weak?' Answer: economy

randomgambit commented 6 years ago

thanks! extremely helpful! This package is amazing!!

Is there a tutorial somewhere to use the igraph package with udpipe?

Also, I remember you wanted to do some performance comparisons with spacy and the other competitors. Have you got the time to do it?

Thanks!

jwijffels commented 6 years ago

To my knowledge there is no such tutorial. I was hoping the NLP R community will build upon the output to generate whichever they had in mind using the wealth of existing R packages.

About spacy, I've made a comparison between spacy and udpipe yesterday here: https://github.com/jwijffels/udpipe-spacy-comparison - comparing mainly accuracy.

randomgambit commented 6 years ago

OK thanks! Will have a look shortly

jwijffels commented 6 years ago

FYI. Below some examples on how to put the dependency network in a graph.

library(udpipe)
udpipe_download_model("english")
m <- udpipe_load_model("english-ud-2.0-170801.udpipe")
x <- udpipe_annotate(m, "The economy is weak but the outlook is bright")
x <- as.data.frame(x)

library(igraph)
edges <- subset(x, head_token_id != 0, select = c("token_id", "head_token_id", "dep_rel"))
edges$label <- edges$dep_rel
g <- graph_from_data_frame(edges,
                           vertices = x[, c("token_id", "token", "lemma", "upos", "xpos", "feats")], 
                           directed = TRUE)
plot(g, vertex.label = x$token)

dep_example1

library(ggraph)
library(ggplot2)
ggraph(g, layout = "fr") +
  geom_edge_link(aes(label = dep_rel), arrow = arrow(length = unit(4, 'mm')), end_cap = circle(3, 'mm')) + 
  geom_node_point(color = "lightblue", size = 5) +
  theme_void(base_family = "") +
  geom_node_text(ggplot2::aes(label = token), vjust = 1.8) +
  ggtitle("Showing dependencies")

More graphical visualisations here: https://www.data-imaginist.com/2017/ggraph-introduction-edges/

dep_example2

randomgambit commented 6 years ago

really amazing, thanks!!!

arademaker commented 6 years ago

I suspect you can have much better results with graphviz , see https://eli.thegreenplace.net/2009/11/23/visualizing-binary-trees-with-graphviz

randomgambit commented 6 years ago

@jwijffels I guess one big picture question is: which textbooks/books/resources do you recommend to get the most out of this package (and NLP in general?). The resources on http://universaldependencies.org/u/dep/index.html are ... very light

jwijffels commented 6 years ago

I believe the documentation at universaldependencies.org is rather good. The question you have is what can you do with the output? Let me list up some elements which I can directly come up with.

Use cases of pos tagging & lemmatisation

better and easier exploratory text visualisations due to richer features
better topic modelling by taking only specific parts-of-speech tags in the topic model
automation of topic modelling for all languages instead of working with stopwords by using the right pos tags
using lemmatisation as a better replacement than stemming in topic modelling
noun phrase extraction or chunking
automatic text summarisation (e.g. using the textrank R package)
automatic keyword detection
look for co-occurrences between words which are relevant based on the POS tag
do better sentence or document similarities by using only the words of a specific POS tag
identification of authors

Use cases of dependency parsing

question answering
semantic parsing & semantic role labelling
information extraction for example
- finding the subject of negative sentiments
- in protein-protein interaction extraction we may want to extract the subject and object of a verb such as phosphorylates
- negation detection can be achieved by finding the governor the negation word.
finding the subject of an object
chat bots
automatic generation of poetry
using all the annotation features as predictive elements in predictive models. E.g. if you have mails and you want to do topic detection, you want to filter out the header/footer element of mails. This can be done with using the features from the annotation as predictive input to find the location of the mail header/footer.
as input to machine translation

If you have other idea's what can be done with the annotation results, feel free to add. About courses. Follow the course at https://lstat.kuleuven.be/training/coursedescriptions/text-mining-with-r, it's given by me so that answer is opinionated. About books, it's hard to find good ones on the topic of text mining.

randomgambit commented 6 years ago

thanks @jwijffels ! very clear - I was hoping that your course was free material though :) hehe

bnosac / udpipe

how to use the dependency parsing? #11