UTMediaCAT / mediacat-docs

Repository with documentation
0 stars 1 forks source link

Twitter Crawler Code #5

Closed kstapelfeldt closed 3 years ago

kstapelfeldt commented 3 years ago

https://github.com/UTMediaCAT/Voyage/blob/master-conversion/src/twitter_crawler.py

Review for currency/approach. Re-write in contemporary python, including tests and following coding standards.

Crawler should, given a list of user handles in .csv grab all possible twarc output and store in a file.

RaiyanRahman commented 3 years ago

Danhua will contact Changyu to learn more about the Twitter crawler as well as read through the previous threads for more information.

Relevant Trello: https://trello.com/c/e4qnjF0p/48-change-api-for-twitter-crawler-post-tweet-api-set-up

kstapelfeldt commented 3 years ago

https://github.com/twintproject/twint is the project that will meet all the stated requirements for crawling all tweets over time. However, a recent change has caused the code to break. The current code maintainer is likely to update this as many people use the project. The developer has an active branch that Danhua is testing.

kstapelfeldt commented 3 years ago

We will wait a week and see if there is action on the broken portion of the TWINT stack and then test it for its functionality in our code.

kstapelfeldt commented 3 years ago

This should be working soon based on Danhua's interaction with the developer.

kstapelfeldt commented 3 years ago

Danhua will look into puppeteer as a solution for crawling twitter May need to log in as a user

kstapelfeldt commented 3 years ago

The owner of getoldtwitter has a working codebase and has sent a video to Danhua. Tonight he will teach Danhua how to use it and she will clarify how to integrate it into the project. Python 3 work.

kstapelfeldt commented 3 years ago

Remaining encoding issue - small formatting issue still to be dealt with. snscrape is the library we are using: https://github.com/JustAnotherArchivist/snscrape

kstapelfeldt commented 3 years ago

Remaining issue: the current crawler is IP blocked - between 4,000-10,000 will trigger the block. Nat suggestions are 1: developing a timing mechanism delay so that each request will have a delay. For example, you make one request and give it 30 second before making the second request. If this doesn't work we will have to consider some kind of IP switching. Limit requests per day.

kstapelfeldt commented 3 years ago

Ongoing IP issues, so we moved back to TWINT. It doesn't use API but can return all tweets from a given user. Output file is quite clean. Please push code to twitter crawler and make this repo private. Then this ticket can be closed.

kstapelfeldt commented 3 years ago

Danhua will take twitter .csv outputs and keep them (for us to discuss with Alejandro re: the extra data like 'hit counts'). She will then concatenate the output CSVs and reshape with the following columns:

jacqueline-chan commented 3 years ago
kstapelfeldt commented 3 years ago

Remaining to do

  1. Add columns for the rest of the data that we get back from TWINT.
  2. Add in some handling for errors and interruptions on the crawl.

--> We need a mechanism to make sure we are notified as to the status of the crawl or if things are interrupted. --> How do we add more resilience or handle interruptions? --> Key errors to handle are exit errors - we don't want the crawler to cut off and then the crawl is interrupted.

Screenshot of relevant error attached

image (5)

jacqueline-chan commented 3 years ago

Here is the full output of the exit signal. Note: this does not happen every time. Only sometimes. output.txt

jacqueline-chan commented 3 years ago

This is what a connection error looks like, and may stop the whole crawl if exit signal received. image

kstapelfeldt commented 3 years ago
jacqueline-chan commented 3 years ago

Potential bug?

image