Rewrote textrank_sentences() - Githubissues

bnosac / textrank

Summarise text by finding relevant sentences and keywords using the Textrank algorithm

77 stars 9 forks source link

Rewrote textrank_sentences() #7

Open emillykkejensen opened 5 years ago

emillykkejensen commented 5 years ago

I rewrote the textrank_sentences() as it could not run my dataset (ran for 3 days without finishing). In doing so I added pbapply to show progress for the sentence_dist function as well as enable parallelization.

A pretty solid upgrad to an already pretty solid function, if I should say so my self!

jwijffels commented 5 years ago

I think you should really test out using the minhash algorithm. That is more a solution if you have large volumes of sentences as if you don't use it it will have to calculate all pairwise sentence similarities. Please try out the minhash algorithm.

jwijffels commented 5 years ago

On another note. Some advise in reducing the dimensionality of the number of sentences: it is better to use text clustering first (e.g. using the topicmodels R package or the BTM package (https://cran.r-project.org/web/packages/BTM/index.html) and inside that cluster apply textrank

emillykkejensen commented 5 years ago

Thanks for the advise - I'm trying out different approaches and have had a look at the minhash algorithm - but it takes longer to run the textrank_candidates_lsh function itself then running the rewritten textrank_sentences. And that's only when it will run - if I run it on all my 12.000 sentences - it will fail and throw an error..

Also had a look at the BTM package, but again it takes a long time to complete. Really the fastes way to do it, is using the textrank_sentences.

jwijffels commented 5 years ago

I've read a bit the changes. Am I correct that the speed difference is basically because you calculate overlap in batches by groups of textrank_id's and because you parallelise the mapply loop?

emillykkejensen commented 5 years ago

Not really - actually haven't even used the parallelise function. The reason is, that I have used data.table and thus using reference = lower memoy and faster speed.

jwijffels commented 5 years ago

Ok, but in that case, can you drop the usage of the pbapply package. In general I'm against adding package dependencies which are not needed. Adding a dependency on another package seems to me overkill. Why not add a simple trace argument and print out something every say 1000 comparisons. That removes another dependency which might give problems in maintaining later on.

emillykkejensen commented 5 years ago

That is a good principle - one I tend to stick with as well, but guess i got carried away :)

I'll have a look at it and write the pbapply package out..

jwijffels commented 5 years ago

great

emillykkejensen commented 5 years ago

Removed the use of pdapply and replaced it with cat - it's not as pretty but it gets the job done, if you want to monitor the progress of the function..

jwijffels commented 5 years ago

Thanks, I'm going to review this soon and incorporate

jwijffels commented 5 years ago

I've reviewed your code and updated it according to what I thought was better readeable. Can you try it out on your own dataset and let me know if this is fine.