Closed jwijffels closed 6 years ago
Hey @jwijffels, thanks for the suggestion. I didn't know about udpipe when I was writing slowraker, but I think I probably would have used it instead of openNLP (which depends on java). As for rapidraker, I actually chose to write the java back-end because I wanted to learn some java (it was really a package more for myself than for other people).
I think it would make sense to change slowraker so that it uses udpipe instead of openNLP, as long as I can retain all of the existing functionality. The tricky part will be maintaining a consistent API for the stop_pos
argument, due to the fact that udpipe may not use the same part-of-speech tags as openNLP. I'll take a look though.
The output of udpipe has a field called upos (universal parts of speech) and xpos (treebank specific parts of speech). For English the xpos are the POS tags from the Penn Treebank - which is what openNLP is outputting also and what you are using in the slow/fastraker functionalities.
library(udpipe)
ud_english <- udpipe_download_model("english")
ud_english <- udpipe_load_model(ud_english$file_model)
x <- udpipe_annotate(ud_english, "some text that has great keywords")
as.data.frame(x)
doc_id paragraph_id sentence_id sentence token_id token lemma upos xpos
1 doc1 1 1 some text that has great keywords 1 some some DET DT
2 doc1 1 1 some text that has great keywords 2 text text NOUN NN
3 doc1 1 1 some text that has great keywords 3 that that PRON WDT
4 doc1 1 1 some text that has great keywords 4 has have VERB VBZ
5 doc1 1 1 some text that has great keywords 5 great great ADJ JJ
6 doc1 1 1 some text that has great keywords 6 keywords keyword NOUN NNS
Never mind, it's a pretty simple algorithm, I just implemented it myself on top of udpipe. It's now available at https://github.com/bnosac/udpipe/blob/master/R/nlp_rake.R Sorry for bothering you about it.
Very interesting way of summarisation. Have you considered building rapidraker on top of the udpipe r package https://cran.r-project.org/web/packages/udpipe/index.html which does all the annotations which are needed as input, does not depend on rjava and is multilanguage. This would give rapid summarisation for any language instead of just english. And it would be nice to compare it to the textrank r package available at https://cran.r-project.org/web/packages/textrank/index.html What do you think?