WebScraping - Githubissues

MrHardcode commented 4 years ago

Web Scraper

The web scraper needs more functionality before it is complete.

Features

[x] Web scrape a given amount of tweets (interval of 20) given a single hashtag
[x] Get Tweets based on multiple hashtags
[x] Save Tweets in a file in a suitable folder
- [x] Organize search files based on hashtag (including if the search contains one or more hashtags)
[x] Create the option of making a fresh search or searching in files from earlier search
[ ] ~~Make it possible to get a specific amount of tweets (not just in the interval of 20)~~
[x] Save tweets as objects that include the raw tweet and an array of the URLs the tweet contained
[x] Scrape date of tweet
- [ ] ~~Optimize date extraction~~
[x] Optimize emoji extraction (use library instead of xlsx file with smaller amount of emojis)
[x] ~~Optimize web scraping using multi processing and/or multithreading~~ Has been optimized, not using multi processing or multi threading though
[x] Add search parameters as a property to the tweet object

Bug fixes

[x] Get rid of the infamous b' in the text created by bs4
[x] Improve or fix the problem with emojis and encoding

MrHardcode commented 4 years ago

I want the make the scraper as fast as possible so I'm avoiding selenium. This creates some challenges since Twitter uses infinite scrolling. I tried to capture the AJAX calls that Twitter makes whenever the user scrolls to the bottom of the page but it's not as simple as just one AJAX call. There are several calls and it's quite confusing to look at. There is a tonne of both POST and GET requests all going to this URL: https://api.twitter.com/1.1/jot/client_event.json I can't find anything in the request body of the POST requests which is even more confusing.

I tried a couple of things including using Postman and I came across the following message: A version of Twitter without JavaScript sounds perfect! Such a version is bound to have regular pagination instead of infinite scrolling. At the bottom of each page the following button containing a link to the next set of tweets can be found

Extracting the link from the button and fetching the next chunk of data should be easy enough and most importantly I think it can all be done using Beautiful Soup.

MrHardcode commented 4 years ago

The standard pagination in legacy Twitter loads 20 tweets at a time and I can't find a way to change that so I think we're stuck with the interval of 20.

MrHardcode commented 4 years ago

Should the /tweets folder be git-ignored?

MrHardcode commented 4 years ago

I'm not sure if the tweets are saved with a proper structure. Right now all tweets are being saved in files representing their search-hashtags in the folder /tweets, so a search for "trump" will lead to the path /tweets/trump and a search with the two parameters "trump" and "biden" will lead to the path /tweets/trump_biden where /tweets is a folder and /trump_biden is a file

Is this a proper structure?

MrHardcode commented 4 years ago

Web scraping is complete now (imo)

Hold-Krykke / PythonExam

WebScraping #7

Web Scraper