MartinKBeck / TwitterScraper

Repository containing all files relevant to my basic and advanced tweet scraping articles.
197 stars 117 forks source link

Scraping irrelevant tweets #7

Closed RockinTheSuburbs closed 2 years ago

RockinTheSuburbs commented 3 years ago

Hi, for some reason I am getting a lot of irrelevant tweets when I run this code. I've got the dataframe set up to show which keyword was used to scrape the tweet. I get a wall of relevant tweets from each user with the keyword listed, and then a whole bunch of irrelevant tweets for which keyword column is blank. Can anybody tell why?

# Imports

import snscrape.modules.twitter as sntwitter
import pandas as pd
# Query by text search
# Setting variables to be used below

maxTweets = 500

# Creating list to append tweet data to tweets_list2 = []

# Creating lists from SearchWords and TwitterHandles txt files:

keywords_list = open("SearchWords.txt", mode='r', encoding='utf-8').read().splitlines()
users_list = open("TwitterHandles.txt", mode='r', encoding='utf-8').read().splitlines()

# Using TwitterSearchScraper to scrape data and append tweets to list

for n, k in enumerate(users_list):
    for m, j in enumerate(keywords_list):
        for i,tweet in enumerate(sntwitter.TwitterSearchScraper('{} from:{} since:2020-07-07 until:2021-07-07'.format(keywords_list[m], users_list[n])).get_items()):
            if i>maxTweets:
                break
            tweets_list2.append([tweet.url, tweet.date, tweet.id, tweet.content, tweet.user.username, tweet.retweetedTweet, keywords_list[m]])

# Creating a dataframe from the tweets list above tweets_df2 = pd.DataFrame(tweets_list2, columns=['URL', 'Datetime', 'Tweet Id', 'Text', 'Username', 'Retweet', 'Keywords'])

# Display first 5 entries from dataframe tweets_df2.head()

# Export dataframe into a CSV tweets_df2.to_csv('text-query-tweets9.csv', sep=',', index=False)

RockinTheSuburbs commented 3 years ago

I realized what it's doing... scraping the relevant tweet by keyword, and then scraping every other tweet tweeted by the user during the dates I provide. Still not sure why it's doing that. I'm also now getting the error "snscrape.base.ScraperException: Unable to find guest token". Previously this error resolved itself; I don't know why or how.

MartinBeckUT commented 3 years ago

Yeah, that happens sometimes. The Python wrapper isn't the best for snscrape. Looks like everything is fine however?

MartinBeckUT commented 2 years ago

Issue looks resolved