Proteusiq / TrustPilotReader

Unofficial TrustPilot Review Collector. Academic Use Only
MIT License
5 stars 2 forks source link

The scraping no longer works. #2

Open ravindersaluja opened 4 years ago

ravindersaluja commented 4 years ago

@Proteusiq The scraping of reviews is no longer working. Calling t.get_reviews() gives out an empty defaultdict like defaultdict(list, {}).

Proteusiq commented 4 years ago

I will fix it tomorrow. Possible they changed their API

Proteusiq commented 4 years ago

I test a non-api route and it worked. Coding on my iPhone. So I think the API has changed

import json
from requests import Session
from bs4 import BeautifulSoup
URL = 'https://dk.trustpilot.com/review/www.if.dk'

session = Session()
r = session.get(URL)
soup = BeautifulSoup(r.text,'html5lib')
data = soup.find('script',{'type':'application/ld+json'})

print(json.loads(data.getText(strip=True)))
ravindersaluja commented 4 years ago

@Proteusiq Even I thought so and tried scraping with requests and BeautifulSoup. But when I am looping over to get all the reviews and then checking the website manually through the browser, I found that it detects suspicious behavior on my IP and then I have to verify that "I am a human".

Proteusiq commented 4 years ago

I see that I can restore it using BeautifulSoup with sleep function. I will wait in fixing it to find out the legality. As BeautifulSoup will overload their servers if this project is misused.

ravindersaluja commented 4 years ago

@Proteusiq Did you try anything further on this?