Open ihabpalamino opened 1 year ago
I have the exact same issue, I can see Selinium searching through Tweets for the specified period, but no data is returned, this is my code:
driver = init_driver(headless=False, show_images=False)
log_in(driver, env=".env")
data = scrape(words=['crypto', 'etheium', 'bitcoin'], hashtag='crypto', since="2023-02-01", until="2023-02-05", from_account=None, interval=1, headless=False, display_type=None, save_images=False, lang="en", resume=False, filter_replies=False, proximity=False, driver=driver)
print(data)
console: looking for tweets between 2023-2-01 and 2023-02-02 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-02%20since%3A2023-2-01%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 looking for tweets between 2023-02-02 and 2023-02-03 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-03%20since%3A2023-02-02%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 scroll 9 looking for tweets between 2023-02-03 and 2023-02-04 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-04%20since%3A2023-02-03%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 looking for tweets between 2023-02-04 and 2023-02-05 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-05%20since%3A2023-02-04%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 looking for tweets between 2023-02-05 and 2023-02-06 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-06%20since%3A2023-02-05%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 looking for tweets between 2023-02-06 and 2023-02-07 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-07%20since%3A2023-02-06%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 looking for tweets between 2023-02-07 and 2023-02-08 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-08%20since%3A2023-02-07%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 looking for tweets between 2023-02-08 and 2023-02-09 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-09%20since%3A2023-02-08%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 scroll 5 scroll 6 scroll 7 scroll 8 looking for tweets between 2023-02-09 and 2023-02-10 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-10%20since%3A2023-02-09%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 looking for tweets between 2023-02-10 and 2023-02-11 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-11%20since%3A2023-02-10%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-11 and 2023-02-12 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-12%20since%3A2023-02-11%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-12 and 2023-02-13 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-13%20since%3A2023-02-12%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-13 and 2023-02-14 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-14%20since%3A2023-02-13%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-14 and 2023-02-15 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-15%20since%3A2023-02-14%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-15 and 2023-02-16 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-16%20since%3A2023-02-15%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-16 and 2023-02-17 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-17%20since%3A2023-02-16%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-17 and 2023-02-18 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-18%20since%3A2023-02-17%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-18 and 2023-02-19 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-19%20since%3A2023-02-18%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 looking for tweets between 2023-02-19 and 2023-02-20 ... path : https://twitter.com/search?q=(%23crypto)%20until%3A2023-02-20%20since%3A2023-02-19%20lang%3Aen&src=typed_query scroll 1 scroll 2 scroll 3 scroll 4 Empty DataFrame Columns: [UserScreenName, UserName, Timestamp, Text, Embedded_text, Emojis, Comments, Likes, Retweets, Image link, Tweet URL]
I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of
find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.
The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.
try:
text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text
except:
text = ""
should actually be:
try:
text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
except:
text = ""
in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:
def get_data(card, save_images=False, save_dir=None):
"""Extract data from tweet card"""
image_links = []
try:
username = card.find_element('xpath','.//span').text
except:
return
try:
handle = card.find_element('xpath','.//span[contains(text(), "@")]').text
except:
return
try:
postdate = card.find_element('xpath','.//time').get_attribute('datetime')
except:
return
try:
text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text
except:
text = ""
try:
embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text
except:
embedded = ""
# tweet url
try:
element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a')
tweet_url = element.get_attribute('href')
except:
return
tweet = (
username, handle, postdate, text, embedded, tweet_url)
return tweet
and does it work?
I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of
find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.
The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.
try: text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text except: text = ""
should actually be:
try: text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text except: text = ""
in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:
def get_data(card, save_images=False, save_dir=None): """Extract data from tweet card""" image_links = [] try: username = card.find_element('xpath','.//span').text except: return try: handle = card.find_element('xpath','.//span[contains(text(), "@")]').text except: return try: postdate = card.find_element('xpath','.//time').get_attribute('datetime') except: return try: text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text except: text = "" try: embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text except: embedded = "" # tweet url try: element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a') tweet_url = element.get_attribute('href') except: return tweet = ( username, handle, postdate, text, embedded, tweet_url) return tweet
I manage to find the problem with this issue. So first u have to go to the function get_data in 'Scweet\utils.py' and change all instances of
find_element_by_xpath('...') to find_element('xpath', '...') As it is no longer supported for latest versions of Selinium.
The second thing is that you have to check if all the functions that return an element from HTML are actually returning something (it appears that if only one element is null the whole Tweet is considered Null for example if Selinium couldn't find the Username of the Tweet). To do this u have to check all the xpaths if they're correct or not. I will give an example but u should check all of them.
try: text = card.find_element('xpath','.//div[2]/div[2]/div[1]').text except: text = ""
should actually be:
try: text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text except: text = ""
in my case I haven't used all that Tweet metadata I've only used the ones I needed and checked if their xpath is correct. there's the final code of the get_data() method:
def get_data(card, save_images=False, save_dir=None): """Extract data from tweet card""" image_links = [] try: username = card.find_element('xpath','.//span').text except: return try: handle = card.find_element('xpath','.//span[contains(text(), "@")]').text except: return try: postdate = card.find_element('xpath','.//time').get_attribute('datetime') except: return try: text = card.find_element('xpath','.//div[2]/div[2]/div[2]/div').text except: text = "" try: embedded = card.find_element('xpath','.//div[2]/div[2]/div[2]').text except: embedded = "" # tweet url try: element = card.find_element('xpath','.//div/div/div[2]/div[2]/div[1]/div/div[1]/div/div/div[2]/div/div[3]/a') tweet_url = element.get_attribute('href') except: return tweet = ( username, handle, postdate, text, embedded, tweet_url) return tweet
Thanks for your great work! But when I use your code, I found a question.
When the result is "reply to @XXXXX XXXXXXXXXX the embeded text XXXXXXX",
I only got the "reply to @XXXXX", I could not get the real embeded text! Could you help me?
from selenium import webdriver from selenium.webdriver.chrome.service import Service from Scweet.scweet import scrape
Specify the parameters for scraping
username = "2MInteractive" since_date = "2023-07-01" until_date = "2023-07-11" headless = True
Set up the ChromeDriver service
service = Service("C:/Users/HP Probook/Downloads/chromedriver.exe") # Replace with the actual path to chromedriver
Set up the ChromeOptions
options = webdriver.ChromeOptions() options.headless = headless
Create the WebDriver
driver = webdriver.Chrome(service=service, options=options)
Scrape the tweets by username
data = scrape(from_account=username, since=since_date, until=until_date, headless=headless, driver=driver)
Print the scraped data
print(data)
Close the WebDriver
driver.quit()
getting empty data "C:\Users\HP Probook\PycharmProjects\firstproject\venv\Scripts\python.exe" "C:/Users/HP Probook/PycharmProjects/firstproject/TikTokScrap.py" looking for tweets between 2023-07-01 and 2023-07-06 ... path : https://twitter.com/search?q=(from%3A2MInteractive)%20until%3A2023-07-06%20since%3A2023-07-01%20&src=typed_query scroll 1 scroll 2 looking for tweets between 2023-07-06 and 2023-07-11 ... path : https://twitter.com/search?q=(from%3A2MInteractive)%20until%3A2023-07-11%20since%3A2023-07-06%20&src=typed_query scroll 1 scroll 2 Empty DataFrame Columns: [UserScreenName, UserName, Timestamp, Text, Embedded_text, Emojis, Comments, Likes, Retweets, Image link, Tweet URL] Index: []
Process finished with exit code 0