bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

comparing noun chunks with Spacy #36

Closed randomgambit closed 5 years ago

randomgambit commented 5 years ago

Hello,

I am trying to extract noun chunks using Spacy and Udpipe and I start realizing how much easier udpipe is to use.

However, I was not able to replicate the noun chunk extraction that I get using Spacy


import spacy

nlp = spacy.load('en_core_web_sm')
doc = nlp(u"Autonomous cars shift insurance liability toward manufacturers")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

Autonomous cars cars nsubj shift
insurance liability liability dobj shift
manufacturers manufacturers pobj toward

Can we get these noun chunks with udpipe as well?

Thanks!

jwijffels commented 5 years ago

Assuming that you have tried udpipe::keywords_phrases and you are looking for more? You need to use the output of the dependencies. Each word points to it's head with a certain relationship as defined by http://universaldependencies.org/u/dep/index.html. Use the columns dep_rel, head_token_id and token_id to get combinations of these relationships. R package rsyntax (https://github.com/vanatteveldt/rsyntax) shows ways on how to extract things and they have wrapped udpipe.

randomgambit commented 5 years ago

Thanks! I have found the rsyntax repo and asked for more clarity about the compatibility with udpipe https://github.com/vanatteveldt/rsyntax/issues/3

randomgambit commented 5 years ago

@jwijffels on a side note, shouldnt keywords_phrases also return the sentence_id or some other identifier? Right now, it is difficult to map its output to the original datafrane. is that on purpose?

jwijffels commented 5 years ago

@randomgambit about keywords_phrases, you can use it in a group-by fashion using data.table as in

library(udpipe)
library(data.table)
x <- setDT(x)
x <- x[, keywords_phrases(phrase_tag, token, pattern = "(A|N)+N(P+D*(A|N)*N)*", is_regex = TRUE, ngram_max = 4, detailed = TRUE), 
       by = list(doc_id, paragraph_id, sentence_id)]
randomgambit commented 5 years ago

@jwijffels who uses data.table in 2018?? :) here is a tidyverse solution based on your nice suggestion.

data_frame(doc_id = c(1),
           text = c('the good jwiffels has a very powerful computer')) %>% 
  udpipe(., object = udmodel_en) %>% as_tibble() %>% 
  mutate(phrase_tag = as_phrasemachine(upos, type = 'upos')) %>% 
  group_by(doc_id, paragraph_id, sentence_id) %>% 
  do(keywords_phrases(.$phrase_tag, .$token, pattern = "(A|N)+N(P+D*(A|N)*N)*", 
                      is_regex = TRUE, 
                      ngram_max = 4, 
                      detailed = TRUE))
# A tibble: 2 x 8
# Groups:   doc_id, paragraph_id, sentence_id [1]
  doc_id paragraph_id sentence_id keyword           ngram pattern start   end
  <chr>         <int>       <int> <chr>             <int> <chr>   <int> <int>
1 1                 1           1 good jwiffels         2 AN          2     3
2 1                 1           1 powerful computer     2 AN          7     8

Note I am still not entirely able to reproduce the output from spacy.

doc = nlp(u"the good jwiffels has a very powerful computer")
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

the good jwiffels jwiffels nsubj has
a very powerful computer computer dobj has

What do you think? Thanks!

jwijffels commented 5 years ago

equally fine for me whichever works for you dplyr or data.table, i'm more the tinyverse user still working on R version 2.9. keywords_phrases looks to contiguous sequences so does not use the dependency parsing output if you want to use the dependency parsing output, you should use them (e.g. with the rsyntax package)

jwijffels commented 5 years ago

closing as an answer was provided in https://github.com/vanatteveldt/rsyntax/issues/3