Twitter Crawler Code - Githubissues

kstapelfeldt commented 3 years ago

https://github.com/UTMediaCAT/Voyage/blob/master-conversion/src/twitter_crawler.py

Review for currency/approach. Re-write in contemporary python, including tests and following coding standards.

Crawler should, given a list of user handles in .csv grab all possible twarc output and store in a file.

RaiyanRahman commented 3 years ago

Danhua will contact Changyu to learn more about the Twitter crawler as well as read through the previous threads for more information.

Relevant Trello: https://trello.com/c/e4qnjF0p/48-change-api-for-twitter-crawler-post-tweet-api-set-up

kstapelfeldt commented 3 years ago

https://github.com/twintproject/twint is the project that will meet all the stated requirements for crawling all tweets over time. However, a recent change has caused the code to break. The current code maintainer is likely to update this as many people use the project. The developer has an active branch that Danhua is testing.

kstapelfeldt commented 3 years ago

We will wait a week and see if there is action on the broken portion of the TWINT stack and then test it for its functionality in our code.

kstapelfeldt commented 3 years ago

This should be working soon based on Danhua's interaction with the developer.

kstapelfeldt commented 3 years ago

Danhua will look into puppeteer as a solution for crawling twitter May need to log in as a user

kstapelfeldt commented 3 years ago

The owner of getoldtwitter has a working codebase and has sent a video to Danhua. Tonight he will teach Danhua how to use it and she will clarify how to integrate it into the project. Python 3 work.

kstapelfeldt commented 3 years ago

Remaining encoding issue - small formatting issue still to be dealt with. snscrape is the library we are using: https://github.com/JustAnotherArchivist/snscrape

kstapelfeldt commented 3 years ago

Remaining issue: the current crawler is IP blocked - between 4,000-10,000 will trigger the block. Nat suggestions are 1: developing a timing mechanism delay so that each request will have a delay. For example, you make one request and give it 30 second before making the second request. If this doesn't work we will have to consider some kind of IP switching. Limit requests per day.

kstapelfeldt commented 3 years ago

Ongoing IP issues, so we moved back to TWINT. It doesn't use API but can return all tweets from a given user. Output file is quite clean. Please push code to twitter crawler and make this repo private. Then this ticket can be closed.

kstapelfeldt commented 3 years ago

Danhua will take twitter .csv outputs and keep them (for us to discuss with Alejandro re: the extra data like 'hit counts'). She will then concatenate the output CSVs and reshape with the following columns:

source URL
Unique ID
Title
Author Metadata
Twitter Handle
Date
HTML content
Plain Text Content
List of links reference

jacqueline-chan commented 3 years ago

New Feature: Given a specific time line, the twitter crawler should return the tweets based on it's date.

kstapelfeldt commented 3 years ago

Remaining to do

Add columns for the rest of the data that we get back from TWINT.
Add in some handling for errors and interruptions on the crawl.

--> We need a mechanism to make sure we are notified as to the status of the crawl or if things are interrupted. --> How do we add more resilience or handle interruptions? --> Key errors to handle are exit errors - we don't want the crawler to cut off and then the crawl is interrupted.

Screenshot of relevant error attached

image (5)

jacqueline-chan commented 3 years ago

Here is the full output of the exit signal. Note: this does not happen every time. Only sometimes. output.txt

jacqueline-chan commented 3 years ago

This is what a connection error looks like, and may stop the whole crawl if exit signal received.

kstapelfeldt commented 3 years ago

Enhance to accept a variable for time (define a period of time to be crawled)
One more function accept a variable for keywords

jacqueline-chan commented 3 years ago

Potential bug?

UTMediaCAT / mediacat-docs

Twitter Crawler Code #5

Remaining to do