Altimis / Scweet

A simple and unlimited twitter scraper : scrape tweets, likes, retweets, following, followers, user info, images...
MIT License
1.05k stars 224 forks source link

getting empty data #169

Open ihabpalamino opened 1 year ago

ihabpalamino commented 1 year ago

from selenium import webdriver from selenium.webdriver.chrome.service import Service from Scweet.scweet import scrape

Specify the parameters for scraping

username = "2MInteractive" since_date = "2023-07-01" until_date = "2023-07-11" headless = True

Set up the ChromeDriver service

service = Service("C:/Users/HP Probook/Downloads/chromedriver.exe") # Replace with the actual path to chromedriver

Set up the ChromeOptions

options = webdriver.ChromeOptions() options.headless = headless

Create the WebDriver

driver = webdriver.Chrome(service=service, options=options)

Scrape the tweets by username

data = scrape(from_account=username, since=since_date, until=until_date, headless=headless, driver=driver)

Print the scraped data

print(data)

Close the WebDriver

driver.quit()

getting empty data "C:\Users\HP Probook\PycharmProjects\firstproject\venv\Scripts\python.exe" "C:/Users/HP Probook/PycharmProjects/firstproject/TikTokScrap.py" looking for tweets between 2023-07-01 and 2023-07-06 ... path : https://twitter.com/search?q=(from%3A2MInteractive)%20until%3A2023-07-06%20since%3A2023-07-01%20&src=typed_query scroll 1 scroll 2 looking for tweets between 2023-07-06 and 2023-07-11 ... path : https://twitter.com/search?q=(from%3A2MInteractive)%20until%3A2023-07-11%20since%3A2023-07-06%20&src=typed_query scroll 1 scroll 2 Empty DataFrame Columns: [UserScreenName, UserName, Timestamp, Text, Embedded_text, Emojis, Comments, Likes, Retweets, Image link, Tweet URL] Index: []

Process finished with exit code 0

baqachadil commented 1 year ago

I have the exact same issue, I can see Selinium searching through Tweets for the specified period, but no data is returned, this is my code:

driver = init_driver(headless=False, show_images=False)

log_in(driver, env=".env")

data = scrape(words=['crypto', 'etheium', 'bitcoin'], hashtag='crypto', since="2023-02-01", until="2023-02-05", from_account=None, interval=1, headless=False, display_type=None, save_images=False, lang="en", resume=False, filter_replies=False, proximity=False, driver=driver)

print(data)

console: looking for tweets between 2023-2-01 and 2023-02-02 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-02%20since%3A2023-2-01%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 looking for tweets between 2023-02-02 and 2023-02-03 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-03%20since%3A2023-02-02%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 scroll 9 looking for tweets between 2023-02-03 and 2023-02-04 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-04%20since%3A2023-02-03%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 looking for tweets between 2023-02-04 and 2023-02-05 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-05%20since%3A2023-02-04%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 looking for tweets between 2023-02-05 and 2023-02-06 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-06%20since%3A2023-02-05%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 looking for tweets between 2023-02-06 and 2023-02-07 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-07%20since%3A2023-02-06%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 looking for tweets between 2023-02-07 and 2023-02-08 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-08%20since%3A2023-02-07%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 looking for tweets between 2023-02-08 and 2023-02-09 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-09%20since%3A2023-02-08%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 looking for tweets between 2023-02-09 and 2023-02-10 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-10%20since%3A2023-02-09%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 looking for tweets between 2023-02-10 and 2023-02-11 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-11%20since%3A2023-02-10%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-11 and 2023-02-12 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-12%20since%3A2023-02-11%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-12 and 2023-02-13 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-13%20since%3A2023-02-12%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-13 and 2023-02-14 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-14%20since%3A2023-02-13%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-14 and 2023-02-15 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-15%20since%3A2023-02-14%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-15 and 2023-02-16 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-16%20since%3A2023-02-15%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-16 and 2023-02-17 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-17%20since%3A2023-02-16%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-17 and 2023-02-18 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-18%20since%3A2023-02-17%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-18 and 2023-02-19 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-19%20since%3A2023-02-18%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-19 and 2023-02-20 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-20%20since%3A2023-02-19%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 Empty DataFrame Columns: [UserScreenName, UserName, Timestamp, Text, Embedded_text, Emojis, Comments, Likes, Retweets, Image link, Tweet URL]

baqachadil commented 1 year ago

I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of

find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.

The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text
except:
        text = ""

should actually be:

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
except:
        text = ""

in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:

def get_data(card, save_images=False, save_dir=None):
    """Extract data from tweet card"""
    image_links = []

    try:
        username = card.find_element('xpath','.//span').text
    except:
        return

    try:
        handle = card.find_element('xpath','.//span[contains(text(), "@")]').text
    except:
        return

    try:
        postdate = card.find_element('xpath','.//time').get_attribute('datetime')
    except:
        return

    try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
    except:
        text = ""

    try:
        embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text
    except:
        embedded = ""

    # tweet url
    try:
        element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a')
        tweet_url = element.get_attribute('href')
    except:
        return

    tweet = (
        username, handle, postdate, text, embedded, tweet_url)
    return tweet
ihabpalamino commented 1 year ago

and does it work?

I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of

find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.

The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text
except:
        text = ""

should actually be:

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
except:
        text = ""

in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:

def get_data(card, save_images=False, save_dir=None):
    """Extract data from tweet card"""
    image_links = []

    try:
        username = card.find_element('xpath','.//span').text
    except:
        return

    try:
        handle = card.find_element('xpath','.//span[contains(text(), "@")]').text
    except:
        return

    try:
        postdate = card.find_element('xpath','.//time').get_attribute('datetime')
    except:
        return

    try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
    except:
        text = ""

    try:
        embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text
    except:
        embedded = ""

    # tweet url
    try:
        element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a')
        tweet_url = element.get_attribute('href')
    except:
        return

    tweet = (
        username, handle, postdate, text, embedded, tweet_url)
    return tweet
wdj1995 commented 1 year ago

I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of

find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.

The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text
except:
        text = ""

should actually be:

try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
except:
        text = ""

in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:

def get_data(card, save_images=False, save_dir=None):
    """Extract data from tweet card"""
    image_links = []

    try:
        username = card.find_element('xpath','.//span').text
    except:
        return

    try:
        handle = card.find_element('xpath','.//span[contains(text(), "@")]').text
    except:
        return

    try:
        postdate = card.find_element('xpath','.//time').get_attribute('datetime')
    except:
        return

    try:
        text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
    except:
        text = ""

    try:
        embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text
    except:
        embedded = ""

    # tweet url
    try:
        element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a')
        tweet_url = element.get_attribute('href')
    except:
        return

    tweet = (
        username, handle, postdate, text, embedded, tweet_url)
    return tweet

Thanks for your great work! But when I use your code, I found a question.

When the result is "reply to @XXXXX XXXXXXXXXX the embeded text XXXXXXX",

I only got the "reply to @XXXXX", I could not get the real embeded text! Could you help me?