Hold-Krykke / PythonExam

4. Semester Python Eksamens Projekt
1 stars 1 forks source link

WebScraping #7

Closed MrHardcode closed 4 years ago

MrHardcode commented 4 years ago

Web Scraper

The web scraper needs more functionality before it is complete.

Features

Bug fixes

MrHardcode commented 4 years ago

I want the make the scraper as fast as possible so I'm avoiding selenium. This creates some challenges since Twitter uses infinite scrolling. I tried to capture the AJAX calls that Twitter makes whenever the user scrolls to the bottom of the page but it's not as simple as just one AJAX call. There are several calls and it's quite confusing to look at. There is a tonne of both POST and GET requests all going to this URL: https://api.twitter.com/1.1/jot/client_event.json I can't find anything in the request body of the POST requests which is even more confusing.

I tried a couple of things including using Postman and I came across the following message: image A version of Twitter without JavaScript sounds perfect! Such a version is bound to have regular pagination instead of infinite scrolling. At the bottom of each page the following button containing a link to the next set of tweets can be found image

Extracting the link from the button and fetching the next chunk of data should be easy enough and most importantly I think it can all be done using Beautiful Soup.

MrHardcode commented 4 years ago

The standard pagination in legacy Twitter loads 20 tweets at a time and I can't find a way to change that so I think we're stuck with the interval of 20.

MrHardcode commented 4 years ago

Should the /tweets folder be git-ignored?

MrHardcode commented 4 years ago

I'm not sure if the tweets are saved with a proper structure. Right now all tweets are being saved in files representing their search-hashtags in the folder /tweets, so a search for "trump" will lead to the path /tweets/trump and a search with the two parameters "trump" and "biden" will lead to the path /tweets/trump_biden where /tweets is a folder and /trump_biden is a file

Is this a proper structure?

MrHardcode commented 4 years ago

Web scraping is complete now (imo)