Open sunnyseera opened 1 month ago
Hello! Thanks for reporting this. Yeah the 403 is definitely a result of PSA using CloudFlare. The requests are being blocked because they're missing some required cookies in the req header. When I navigate to a pop report page (I'm using https://www.psacard.com/pop/baseball-cards/2018/topps-update/161401 locally as an example), I can see the GetSetItems
XHR call in the network calls of the DevTools UI in Chrome. I can right-click that call, copy as CURL (which contains all the required cookie headers AND has a legit UA), run the curl in a terminal, and get back the pop report json, just like the Python program used to be able to do.
This project used to use Selenium and a WebDriver to scrape PSA data, if I were to go back to using that I'm fairly certain this would work again. Using that approach, the web driver should have all the required headers to get around CloudFlare.
This is what the project/code looked like with Selenium.
I might at some point try to get that working again, but probably won't get to it for a while. I don't have the drive or desire to keep up with the PSA website changes.
I spent some time today getting the Selenium + Webdriver solution working again on the pop report pages. However, the pagination is completely not working on ChromeDriver. I can see the pagination at the bottom of the page, and I can point selenium at those page elements, but the page will not load anything past the first page. Even when I load the page in ChromeDriver and manually try to click thru to page 2, I can interact with the pagination elements, but nothing beyond page 1 will load.
I went to download PhantomJS, I've used that years ago for headless driver scraping, but that project was archived in 2018 :(
I don't know if pagination not working is intentional (CloudFlare detects it's a webdriver), or it's just a bug.
I'm walking away from this for now.
Hey man, if I try to run this locally I get the following:
So when I try the URL (I changed it to a Pokemon one) firstly I get forbidden 403 and then theres a json_data error.
Is there anyway to resolve this? I have been trying to fix it locally but got no where.
I am running it using Python3
I am running it on Ubuntu 22.04.3 LTS
I can resolve the json_data error by doing the following:
Fix 1 of 2 Changing this:
To this:
Fix 2 of 2 Changing this:
To this:
If I put those changes in, I am still left with the 403 Error:
Error pulling data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801, with error: 403 Client Error: Forbidden for url: https://www.psacard.com/Pop/GetSetItems
Sorry for the long message but I wanted to give as much context as possible!
Hopefully you can help resolve this!
I think the 403 might come from Cloudflare blocking the request.