bnosac / textrank

Summarise text by finding relevant sentences and keywords using the Textrank algorithm
76 stars 9 forks source link

unique sentence_id not possible #1

Closed sanjmeh closed 6 years ago

sanjmeh commented 6 years ago

Refer to your textrank vignette for textrank_sentence function. Refer specifically to the command

tr <- textrank_sentences(data = sentences, terminology = terminology)

Now this returns an error Error: sum(duplicated(data[, 1])) == 0 is not TRUE

The reason for this appears to be non-unique sentence_ids.

There cannot be unique sentence_ids because it is reset after every paragraph ending in any CONLLU format file.

Your vignette is either incomplete or there is a bug in textrank_sentence

jwijffels commented 6 years ago

Please have a look at the documentation. It mentions that in data you need to provide: a data.frame with 1 row per sentence where the first column is an identifier of a sentence (e.g. textrank_id) and the second column is the raw sentence. See the example. And in terminology, you need to provide: a data.frame with with one row per token indicating which token is part of each sentence. The first column in this data.frame is the identifier which corresponds to the first column of data and the second column indicates the token which is part of the sentence which will be passed on to textrank_dist. See the example.

If you don't have data in that format, just make sure you get to data in that format. E.g. as follows:

library(udpipe)
library(textrank)
data("brussels_reviews", package = "udpipe")

udmodel <- udpipe_download_model("spanish")
udmodel <- udpipe_load_model(udmodel$file_model)

x <- udpipe_annotate(udmodel, brussels_reviews$feedback[1:10])
x <- as.data.frame(x)
x$textrank_id <- unique_identifier(x, fields = c("doc_id", "paragraph_id", "sentence_id"))

result <- textrank_sentences(data = unique(x[, c("textrank_id", "sentence")]), 
                             terminology = x[, c("textrank_id", "lemma")])
result
sanjmeh commented 6 years ago

Thank you... it was the unique_identifier() function that solved the problem. Although I did create a "compound key" column by using paste(doc_id,paragraph_id,sentence_id) but it seems the character column was not acceptable. Anyway it would be great if you edit the vignette and add the step of unique identifier i.e. x$textrank_id <- unique_identifier(x, fields = c("doc_id", "paragraph_id", "sentence_id")) It will help if someone else is trying the same and gets stuck like me. Thank you.