Closed kstapelfeldt closed 3 years ago
Danhua will contact Changyu to learn more about the Twitter crawler as well as read through the previous threads for more information.
Relevant Trello: https://trello.com/c/e4qnjF0p/48-change-api-for-twitter-crawler-post-tweet-api-set-up
https://github.com/twintproject/twint is the project that will meet all the stated requirements for crawling all tweets over time. However, a recent change has caused the code to break. The current code maintainer is likely to update this as many people use the project. The developer has an active branch that Danhua is testing.
We will wait a week and see if there is action on the broken portion of the TWINT stack and then test it for its functionality in our code.
This should be working soon based on Danhua's interaction with the developer.
Danhua will look into puppeteer as a solution for crawling twitter May need to log in as a user
The owner of getoldtwitter has a working codebase and has sent a video to Danhua. Tonight he will teach Danhua how to use it and she will clarify how to integrate it into the project. Python 3 work.
Remaining encoding issue - small formatting issue still to be dealt with. snscrape is the library we are using: https://github.com/JustAnotherArchivist/snscrape
Remaining issue: the current crawler is IP blocked - between 4,000-10,000 will trigger the block. Nat suggestions are 1: developing a timing mechanism delay so that each request will have a delay. For example, you make one request and give it 30 second before making the second request. If this doesn't work we will have to consider some kind of IP switching. Limit requests per day.
Ongoing IP issues, so we moved back to TWINT. It doesn't use API but can return all tweets from a given user. Output file is quite clean. Please push code to twitter crawler and make this repo private. Then this ticket can be closed.
Danhua will take twitter .csv outputs and keep them (for us to discuss with Alejandro re: the extra data like 'hit counts'). She will then concatenate the output CSVs and reshape with the following columns:
--> We need a mechanism to make sure we are notified as to the status of the crawl or if things are interrupted. --> How do we add more resilience or handle interruptions? --> Key errors to handle are exit errors - we don't want the crawler to cut off and then the crawl is interrupted.
Screenshot of relevant error attached
Here is the full output of the exit signal. Note: this does not happen every time. Only sometimes. output.txt
This is what a connection error looks like, and may stop the whole crawl if exit signal received.
Potential bug?
https://github.com/UTMediaCAT/Voyage/blob/master-conversion/src/twitter_crawler.py
Review for currency/approach. Re-write in contemporary python, including tests and following coding standards.
Crawler should, given a list of user handles in .csv grab all possible twarc output and store in a file.