cuilimeng / CoAID

102 stars 47 forks source link

Scripts to download/crawl the Twitter info #2

Open shaanchandra opened 4 years ago

shaanchandra commented 4 years ago

Hi, Thank you for this important and timely dataset.

I wanted to request you to share the scripts to download/crawl the relevant user information from twitter (such as the tweets/retweets, follower/following, etc).

I know you can not distribute that data, but you can provide scripts of how to crawl that ourselves I guess.

ethan-w-roland commented 4 years ago

I found this script helpful:

import pandas as pd

fake = pd.read_csv("https://raw.githubusercontent.com/cuilimeng/CoAID/master/05-01-2020/ClaimFakeCOVID-19_tweets.csv")
real = pd.read_csv("https://raw.githubusercontent.com/cuilimeng/CoAID/master/05-01-2020/ClaimRealCOVID-19_tweets.csv")

fake["label"] = "fake"
real["label"] = "real"
df = pd.concat([fake, real])
df["text"] = "None"

import requests
from bs4 import BeautifulSoup

for i, row in df.iterrows():
  id = row.tweet_id
  url = "https://mobile.twitter.com/Richx183/status/" + str(id)
  body = requests.get(url)
  body = BeautifulSoup(body.content, 'html.parser')
  for el in body.find_all("div", attrs={"data-id":str(id)}):
    text = ""
    for x in el.div.contents:
      x = str(x)
      if "class=" not in x:
        text += x
      text = text.strip()
    df.at[i, "text"] = text

df = df.drop(df[df.text == "None"].index) #drop unnsuccessful queries

df.head()