how to scrape a large body of tweets (e.g. a million)?

aliiabbasi commented 4 years ago

I wanted to know what exactly the criteria are for the extracted Toptweets? Is it the likes or the number of retweets? And also, is it possible to extract a large body of tweets (e.g. a million tweets) using this library? I tried it for almost 800K tweets, but it neither respond nor it did error!

julienbeisel commented 4 years ago

hey @aliiabbasi

I am currently trying to scrape tweets containing the word "coronavirus" for sentiment analysis purposes and I can explain my workaround, maybe it could give you some advices :)

I tried to download every tweets since January but it didn't work well... 24h wasn't enough and maybe splitting the request for each day is the great way to do it but I don't want to wait approximately 2 days (the full request for one day lasted 1h!). So what I am doing now is scraping each day 1000 "top tweets"

What I'm observing is :

when I set the date to January 22 i first get tweets from january 22 late at night, then tweets earlier that day until 00:00
when I set topTweets argument to "True", I only get tweets with high retweets and favorite numbers, and it's still in the same order.

example :

When there is no top tweets left to scrape, the other tweets are just all the tweets the request would find if i didnt specify "topTweets"

I'm still trying to find ways to analyze covid-19 related tweets though, if you find better ways to scrape a lot of tweets don't hesitate to comment

anshumanchak commented 4 years ago

How is the topTweets criteria calculated? Is it a direct sort of retweets or a combination of retweets, favourites and replies? @aliiabbasi

victorperezc commented 4 years ago

@anshumanchak The topTweets criteria is determined by twitter. You can enable twitter to show you the tweets he consider more relevant, normally are those having high engagement with the network.

anshumanchak commented 4 years ago

Thanks @Victorpc98

AshtonCoop commented 4 years ago

Try to to similar things. For 1000 tweets, it took me 2min 35 seconds. So imaging 1 million tweets... One way that I can think of:

use multiprocessing
setTopTweets to limit results

SRaina11 commented 4 years ago

Hi @julienbeisel

I'm currently trying to scrape tweets for sentiment analysis. I'm also using the setNear("County in Ireland") and setQuerySearch("Covid19" or "covid" or "corona" or ......), but not getting many tweets. Is there a way to scrape more tweets? Thanks

julienbeisel commented 4 years ago

Hi @SRaina11

When I did this project I had the same issue. If I remember well, you can query tweets from short time periods and then have more tweets. For example, instead of getting the tweets from the last 6 month, you can divide it into 6x4 queries corresponding to the tweets each week, or into 6x4x7 queries corresponding to the tweets each day. Sometimes it didnt work well and it stopped working after some queries, you can maybe wait before each query so the API does not stop your queries.

I hope it'll help you!

SRaina11 commented 4 years ago

Hi @SRaina11

When I did this project I had the same issue. If I remember well, you can query tweets from short time periods and then have more tweets. For example, instead of getting the tweets from the last 6 month, you can divide it into 6x4 queries corresponding to the tweets each week, or into 6x4x7 queries corresponding to the tweets each day. Sometimes it didnt work well and it stopped working after some queries, you can maybe wait before each query so the API does not stop your queries.

I hope it'll help you!

Thank you for your response @julienbeisel. Would you know if we can use OR operator and try to find tweets for multiple keywords. For eg: tweetCriteria.setQuerySearch("Covid19" or "covid" or "corona")?

Cheers

Jefferson-Henrique / GetOldTweets-python

how to scrape a large body of tweets (e.g. a million)? #261