bnosac / textrank

Summarise text by finding relevant sentences and keywords using the Textrank algorithm
76 stars 9 forks source link

Might I suggest #6

Closed emillykkejensen closed 5 years ago

emillykkejensen commented 5 years ago

Thanks for a great package

When running textrank_sentences() on very large datasets, the textrank_candidates_all() (in particularly the utils::combn() function within) can’t really cope and throws an error. Therefor I have built a simpler textrank_candidates_all() which I believe can do the same job – but faster and more memory efficient.

textrank_candidates_all2 <- function(x){

  x <- unique(x)
  x <- setdiff(x, NA)

  x_length <- length(x)

  dtlist <- lapply(seq(x)[-x_length], function(i){
    data.table::data.table(textrank_id_1 = x[i], textrank_id_2 = x[(i+1L):x_length])
  })

  candidates <- data.table::rbindlist(dtlist)
  candidates <- data.table::setDF(candidates)

  return(candidates)

}

To compare with the old, try running:

id_list <- 1:100000

textrank_candidates <- textrank_candidates_all(id_list)
textrank_candidates2 <- textrank_candidates_all2(id_list)

Here I get an error using textrank_candidates_all() but not using textrank_candidates_all2()

If you lower the number of id's and run it again, you will get big performance difference between the two functions:

id_list <- 1:3000

system.time(textrank_candidates <- textrank_candidates_all(id_list))
system.time(textrank_candidates2 <- textrank_candidates_all2(id_list))

which gives me:

For textrank_candidates_all()

   user  system elapsed 
  9.305   0.474  15.236

…and textrank_candidates_all2()

   user  system elapsed 
  0.922   0.004   1.397

Finally the two functions seems to output the same values:

identical(textrank_candidates, textrank_candidates2) is TRUE

So, you could consider implementing this function if you please.

jwijffels commented 5 years ago

Thanks. Incorporated it into the package. If you can test, I'll upload the changes to cran.

emillykkejensen commented 5 years ago

Just tried it out - and it works fine :)

jwijffels commented 5 years ago

Ok. Great. I'll upload to CRAN tomorrow.

jwijffels commented 5 years ago

I've update the code, if less than 200 sentences, combn is used, more than 200 sentences data.table as that seems to be the tipoffpoint where one approach is faster than the other. Pushed to cran now.