crew102 / rapidraker

A fast version of the Rapid Automatic Keyword Extraction (RAKE) algorithm
https://crew102.github.io/slowraker/articles/rapidraker.html
Other
1 stars 0 forks source link

Build rapidraker on top of udpipe? #1

Closed jwijffels closed 6 years ago

jwijffels commented 6 years ago

Very interesting way of summarisation. Have you considered building rapidraker on top of the udpipe r package https://cran.r-project.org/web/packages/udpipe/index.html which does all the annotations which are needed as input, does not depend on rjava and is multilanguage. This would give rapid summarisation for any language instead of just english. And it would be nice to compare it to the textrank r package available at https://cran.r-project.org/web/packages/textrank/index.html What do you think?

crew102 commented 6 years ago

Hey @jwijffels, thanks for the suggestion. I didn't know about udpipe when I was writing slowraker, but I think I probably would have used it instead of openNLP (which depends on java). As for rapidraker, I actually chose to write the java back-end because I wanted to learn some java (it was really a package more for myself than for other people).

I think it would make sense to change slowraker so that it uses udpipe instead of openNLP, as long as I can retain all of the existing functionality. The tricky part will be maintaining a consistent API for the stop_pos argument, due to the fact that udpipe may not use the same part-of-speech tags as openNLP. I'll take a look though.

jwijffels commented 6 years ago

The output of udpipe has a field called upos (universal parts of speech) and xpos (treebank specific parts of speech). For English the xpos are the POS tags from the Penn Treebank - which is what openNLP is outputting also and what you are using in the slow/fastraker functionalities.

library(udpipe)

ud_english <- udpipe_download_model("english")
ud_english <- udpipe_load_model(ud_english$file_model)

x <- udpipe_annotate(ud_english, "some text that has great keywords")
as.data.frame(x)

  doc_id paragraph_id sentence_id                          sentence token_id    token   lemma upos xpos
1   doc1            1           1 some text that has great keywords        1     some    some  DET   DT
2   doc1            1           1 some text that has great keywords        2     text    text NOUN   NN
3   doc1            1           1 some text that has great keywords        3     that    that PRON  WDT
4   doc1            1           1 some text that has great keywords        4      has    have VERB  VBZ
5   doc1            1           1 some text that has great keywords        5    great   great  ADJ   JJ
6   doc1            1           1 some text that has great keywords        6 keywords keyword NOUN  NNS
jwijffels commented 6 years ago

Never mind, it's a pretty simple algorithm, I just implemented it myself on top of udpipe. It's now available at https://github.com/bnosac/udpipe/blob/master/R/nlp_rake.R Sorry for bothering you about it.