bnosac / udpipe

R package for Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing Based on the UDPipe Natural Language Processing Toolkit
https://bnosac.github.io/udpipe/en
Mozilla Public License 2.0
209 stars 33 forks source link

Error when using sep = "_" in function keywords_collocation #56

Closed love-borjeson closed 5 years ago

love-borjeson commented 5 years ago

Setting the argument sep to e.g. underscore ("_") when using keywords_collocation, creates weird results downstream when using the function txt_recodengram: only a fraction of the mwe:s are stored in the term column. Not a biggie, " " can easily be replaced with "" in the dtf using e.g. gsub. The hard part was to isolate the culprit, which seems to be the argument sep. Where the actual error is I don't know, but I would guess it is in the function txt_recode_ngram.

jwijffels commented 5 years ago

Thanks for the feedback. Can you provide a reproducible example of the weird behaviour you are encountering?

love-borjeson commented 5 years ago

First of all: udpipe is awesome!

Here's the code: library(udpipe) ud_model <- udpipe_download_model(language = "french") data(brussels_reviews) comments <- subset(brussels_reviews, language %in% "fr")

ud_model <- udpipe_load_model(ud_model$file_model) x <- udpipe_annotate(ud_model, x = comments$feedback, doc_id = comments$id) x <- as.data.frame(x)

x$topic_level_id <- unique_identifier(x, fields = c("doc_id", "paragraph_id", "sentence_id"))

Collocations...

x_n_a <- subset(x, upos %in% c("NOUN", "ADJ"))

colloc <- keywords_collocation(x_n_a, term = "lemma", group = c("sentence_id"), ngram_max = 2, nmin = 10, sep = "")

x_n_a$term <- x_n_a$lemma x_n_a$term <- txt_recode_ngram(x_n_a$term, compound = colloc$keyword, ngram = colloc$ngram)

x_n_a$term <- ifelse(x_n_a$upos %in% c("NOUN", "ADJ"), x_n_a$term, ifelse(x_n_a$term %in% c(colloc$keyword), x_n_a$term, NA))

dtf <- document_term_frequencies(x_n_a, document = "doc_id", term = "term") #model on docs. which(dtf == "salle", arr.ind=TRUE) which(dtf == "salle_bain", arr.ind=TRUE) #collocations does not get recoded to the dtf properly

Redo it without the 'sep' argument

colloc <- keywords_collocation(x_n_a, term = "lemma", group = c("sentence_id"), ngram_max = 2, n_min = 10) #Here, skip the argument 'sep' to see the difference.

x_n_a$term <- x_n_a$lemma x_n_a$term <- txt_recode_ngram(x_n_a$term, compound = colloc$keyword, ngram = colloc$ngram)

x_n_a$term <- ifelse(x_n_a$upos %in% c("NOUN", "ADJ"), x_n_a$term, ifelse(x_n_a$term %in% c(colloc$keyword), x_n_a$term, NA))

dtf <- document_term_frequencies(x_n_a, document = "doc_id", term = "term") #model on docs. which(dtf == "salle", arr.ind=TRUE) which(dtf == "salle bain", arr.ind=TRUE) #Here the collocations are included in the dtf.

jwijffels commented 5 years ago

Thanks for the report, I will look in to it next week

jwijffels commented 5 years ago

If you use keywords_collocationwith argument sep="-", you also need to use txt_recode_ngramwith argument sep = "-" So your code should be Watch 2 times the use of sep = "-"

colloc <- keywords_collocation(x_n_a, term = "lemma", group = c("sentence_id"), ngram_max = 2, n_min = 10, sep = "_")
x_n_a$term <- x_n_a$lemma
x_n_a$term <- txt_recode_ngram(x_n_a$term, compound = colloc$keyword, ngram = colloc$ngram, sep = "_")