Jefferson-Henrique / GetOldTweets-python

A project written in Python to get old tweets, it bypass some limitations of Twitter Official API.
MIT License
1.35k stars 808 forks source link

how to scrape a large body of tweets (e.g. a million)? #261

Open aliiabbasi opened 4 years ago

aliiabbasi commented 4 years ago

I wanted to know what exactly the criteria are for the extracted Toptweets? Is it the likes or the number of retweets? And also, is it possible to extract a large body of tweets (e.g. a million tweets) using this library? I tried it for almost 800K tweets, but it neither respond nor it did error!

julienbeisel commented 4 years ago

hey @aliiabbasi

I am currently trying to scrape tweets containing the word "coronavirus" for sentiment analysis purposes and I can explain my workaround, maybe it could give you some advices :)

I tried to download every tweets since January but it didn't work well... 24h wasn't enough and maybe splitting the request for each day is the great way to do it but I don't want to wait approximately 2 days (the full request for one day lasted 1h!). So what I am doing now is scraping each day 1000 "top tweets"

What I'm observing is :

example :

image image

I'm still trying to find ways to analyze covid-19 related tweets though, if you find better ways to scrape a lot of tweets don't hesitate to comment

anshumanchak commented 4 years ago

How is the topTweets criteria calculated? Is it a direct sort of retweets or a combination of retweets, favourites and replies? @aliiabbasi

victorperezc commented 4 years ago

@anshumanchak The topTweets criteria is determined by twitter. You can enable twitter to show you the tweets he consider more relevant, normally are those having high engagement with the network.

anshumanchak commented 4 years ago

Thanks @Victorpc98

AshtonCoop commented 4 years ago

Try to to similar things. For 1000 tweets, it took me 2min 35 seconds. So imaging 1 million tweets... One way that I can think of:

  1. use multiprocessing
  2. setTopTweets to limit results
SRaina11 commented 4 years ago

Hi @julienbeisel

I'm currently trying to scrape tweets for sentiment analysis. I'm also using the setNear("County in Ireland") and setQuerySearch("Covid19" or "covid" or "corona" or ......), but not getting many tweets. Is there a way to scrape more tweets? Thanks

julienbeisel commented 4 years ago

Hi @SRaina11

When I did this project I had the same issue. If I remember well, you can query tweets from short time periods and then have more tweets. For example, instead of getting the tweets from the last 6 month, you can divide it into 6x4 queries corresponding to the tweets each week, or into 6x4x7 queries corresponding to the tweets each day. Sometimes it didnt work well and it stopped working after some queries, you can maybe wait before each query so the API does not stop your queries.

I hope it'll help you!

SRaina11 commented 4 years ago

Hi @SRaina11

When I did this project I had the same issue. If I remember well, you can query tweets from short time periods and then have more tweets. For example, instead of getting the tweets from the last 6 month, you can divide it into 6x4 queries corresponding to the tweets each week, or into 6x4x7 queries corresponding to the tweets each day. Sometimes it didnt work well and it stopped working after some queries, you can maybe wait before each query so the API does not stop your queries.

I hope it'll help you!

Thank you for your response @julienbeisel. Would you know if we can use OR operator and try to find tweets for multiple keywords. For eg: tweetCriteria.setQuerySearch("Covid19" or "covid" or "corona")?

Cheers