Closed MrHardcode closed 4 years ago
I want the make the scraper as fast as possible so I'm avoiding selenium. This creates some challenges since Twitter uses infinite scrolling. I tried to capture the AJAX calls that Twitter makes whenever the user scrolls to the bottom of the page but it's not as simple as just one AJAX call. There are several calls and it's quite confusing to look at. There is a tonne of both POST and GET requests all going to this URL: https://api.twitter.com/1.1/jot/client_event.json I can't find anything in the request body of the POST requests which is even more confusing.
I tried a couple of things including using Postman and I came across the following message: A version of Twitter without JavaScript sounds perfect! Such a version is bound to have regular pagination instead of infinite scrolling. At the bottom of each page the following button containing a link to the next set of tweets can be found
Extracting the link from the button and fetching the next chunk of data should be easy enough and most importantly I think it can all be done using Beautiful Soup.
The standard pagination in legacy Twitter loads 20 tweets at a time and I can't find a way to change that so I think we're stuck with the interval of 20.
Should the /tweets
folder be git-ignored?
I'm not sure if the tweets are saved with a proper structure. Right now all tweets are being saved in files representing their search-hashtags in the folder /tweets
, so a search for "trump" will lead to the path /tweets/trump
and a search with the two parameters "trump" and "biden" will lead to the path /tweets/trump_biden
where /tweets
is a folder and /trump_biden
is a file
Is this a proper structure?
Web scraping is complete now (imo)
Web Scraper
The web scraper needs more functionality before it is complete.
Features
Make it possible to get a specific amount of tweets (not just in the interval of 20)Optimize date extractionOptimize web scraping using multi processing and/or multithreadingHas been optimized, not using multi processing or multi threading thoughBug fixes
b'
in the text created by bs4