Similarity calculation is too slow

bkrdmr commented 2 years ago

I have a Twitter V2 dataset via Twarc2 with appr. 9 million tweets. I am using the entire workflow as described in readme and function docs. In general, network computations work fine.

For co-similar-tweet networks, similarity computation takes too much time, especially if the window is set for ~1 hour. The threshold is 0.9. At the time of this thread, it's been 5 hours and ongoing.

I've also tried 5-second and 1-minute windows. They worked ok. But for meaningful results, I assume longer time frames will be more useful. Usually, the results of co-tweet behavior also suggest the same.

Do you think this is only because of the size of my dataset? Or would there be another underlying problem?

SamHames commented 2 years ago

Do you think this is only because of the size of my dataset?

That sounds about right to me - co-similarity is the most compute intensive thing the toolkit can do, and can't be optimised in the same way that all the other network types can. It should finish eventually though.

Anecdotally, from other projects, it also turns out that the low hanging fruit on Twitter is co-tweets, not co-similarity. The default text handling for co-tweet networks ignores @mentions and text case, so should catch many types of copy-paste behaviour, whether it's reply spam or something else. It also let's you expand the time window to days because the indexes are very effective for only doing necessary comparisons.

bkrdmr commented 2 years ago

This is helpful. Thank you for your detailed response.

QUT-Digital-Observatory / coordination-network-toolkit

Similarity calculation is too slow #41