Open emillykkejensen opened 5 years ago
I think you should really test out using the minhash algorithm. That is more a solution if you have large volumes of sentences as if you don't use it it will have to calculate all pairwise sentence similarities. Please try out the minhash algorithm.
On another note. Some advise in reducing the dimensionality of the number of sentences: it is better to use text clustering first (e.g. using the topicmodels R package or the BTM package (https://cran.r-project.org/web/packages/BTM/index.html) and inside that cluster apply textrank
Thanks for the advise - I'm trying out different approaches and have had a look at the minhash algorithm - but it takes longer to run the textrank_candidates_lsh function itself then running the rewritten textrank_sentences. And that's only when it will run - if I run it on all my 12.000 sentences - it will fail and throw an error..
Also had a look at the BTM package, but again it takes a long time to complete. Really the fastes way to do it, is using the textrank_sentences.
I've read a bit the changes. Am I correct that the speed difference is basically because you calculate overlap in batches by groups of textrank_id's and because you parallelise the mapply loop?
Not really - actually haven't even used the parallelise function. The reason is, that I have used data.table and thus using reference = lower memoy and faster speed.
Ok, but in that case, can you drop the usage of the pbapply package. In general I'm against adding package dependencies which are not needed. Adding a dependency on another package seems to me overkill. Why not add a simple trace argument and print out something every say 1000 comparisons. That removes another dependency which might give problems in maintaining later on.
That is a good principle - one I tend to stick with as well, but guess i got carried away :)
I'll have a look at it and write the pbapply package out..
great
Removed the use of pdapply and replaced it with cat - it's not as pretty but it gets the job done, if you want to monitor the progress of the function..
Thanks, I'm going to review this soon and incorporate
I've reviewed your code and updated it according to what I thought was better readeable. Can you try it out on your own dataset and let me know if this is fine.
I rewrote the textrank_sentences() as it could not run my dataset (ran for 3 days without finishing). In doing so I added pbapply to show progress for the sentence_dist function as well as enable parallelization.
A pretty solid upgrad to an already pretty solid function, if I should say so my self!