furas / python-examples

Python examples from my answers on Stackoverflow and other short scripts.
https://blog.furas.pl
MIT License
172 stars 73 forks source link

Tripadvisor reviewers scraping total num of reviews #9

Open karmyras opened 2 years ago

karmyras commented 2 years ago

Hi, can i ask you how can i find the total number of reviews on each tripadvisor profile? i have around 20k of profiles but i do not know how to find the total number of reviews from them.

eg. https://www.tripadvisor.com/Profile/davideL8413AD https://www.tripadvisor.com/Profile/PhilB2846 https://www.tripadvisor.com/Profile/Spockiwocki https://www.tripadvisor.com/Profile/Peterkel https://www.tripadvisor.com/Profile/SMCP1992 https://www.tripadvisor.com/Profile/yes2luvtravel

result:
davideL8413AD : 12 PhilB2846: 33 Spockiwocki: 8

etc

thank you in advance Kostas

furas commented 2 years ago

Hi,

It uses JavaScript to add elements.

Using Selenium you can visit page with ?tab=reviews to see all reviews.

But it may need also to click button Show more because it shows only first 20 reviews.

And at start it needs to click button I Accept to accept cookies.

from selenium import webdriver
from selenium.webdriver.common.by import By
#from webdriver_manager.chrome import ChromeDriverManager
from webdriver_manager.firefox import GeckoDriverManager
import time

url = 'https://www.tripadvisor.com/Profile/yes2luvtravel?tab=reviews'

#driver = webdriver.Chrome(executable_path=ChromeDriverManager().install())
driver = webdriver.Firefox(executable_path=GeckoDriverManager().install())

driver.get(url)

time.sleep(3)

# accept cookies
buttons = driver.find_elements(By.XPATH, '//button[@id="onetrust-accept-btn-handler"]')
if buttons:
    print('click Accept')
    buttons[0].click()

# click `Show More`  (few times)
while True:
    time.sleep(3)
    buttons = driver.find_elements(By.XPATH, '//div[@id="content"]//button')
    if not buttons:
        break
    print('click Show More')
    buttons[0].click()

# count all reviews
all_items = driver.find_elements(By.XPATH, '//div[@id="content"]//div[contains(@class, "section")]')
print('len(all_items):', len(all_items))