ChrisMuir / psa-scrape

PSA Cards Web Scraper
38 stars 22 forks source link

403 Forbidden and JSON Data Error #10

Open sunnyseera opened 1 month ago

sunnyseera commented 1 month ago

Hey man, if I try to run this locally I get the following:

collecting data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801

Error pulling data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801, with error: 403 Client Error: Forbidden for url: https://www.psacard.com/Pop/GetSetItems

Traceback (most recent call last):
  File "/home/USERNAME/psa-scrape/pop_report/original_to_github.py", line 112, in <module>
    ppr.scrape()
  File "/home/USERNAME/psa-scrape/pop_report/original_to_github.py", line 43, in scrape
    cards = json_data["data"]
UnboundLocalError: local variable 'json_data' referenced before assignment

So when I try the URL (I changed it to a Pokemon one) firstly I get forbidden 403 and then theres a json_data error.

Is there anyway to resolve this? I have been trying to fix it locally but got no where.

I am running it using Python3

I am running it on Ubuntu 22.04.3 LTS

I can resolve the json_data error by doing the following:

Fix 1 of 2 Changing this:

    try:
        json_data = self.post_to_url(sess, form_data)
    except Exception as err:
        print("Error pulling data for {}, with error: {}".format(self.set_name, err))
    cards = json_data["data"]  # This line causes UnboundLocalError if the try block fails

To this:

    try:
        json_data = self.post_to_url(sess, form_data)
    except Exception as err:
        print("Error pulling data for {}, with error: {}".format(self.set_name, err))
        return  # Early exit if an error occurs

    # Ensure json_data is valid before proceeding
    if not json_data or "data" not in json_data:
        print("No valid data found for set: {}".format(self.set_name))
        return  # Exit if there's no data

    cards = json_data["data"]  # Now safe to access since we checked for validity

Fix 2 of 2 Changing this:

json_data = self.post_to_url(sess, form_data)
cards += json_data["data"]  # Assumes json_data is valid

To this:

try:
    json_data = self.post_to_url(sess, form_data)
    if not json_data or "data" not in json_data:
        print("No valid data found for additional page: {}".format(curr_page))
        break  # Exit loop if there's no more data
    cards += json_data["data"]
except Exception as err:
    print("Error pulling additional data for set {}, page {}: {}".format(self.set_name, curr_page, err))
    break  # Exit loop on error

If I put those changes in, I am still left with the 403 Error: Error pulling data for https://www.psacard.com/pop/tcg-cards/1999/pokemon-game/57801, with error: 403 Client Error: Forbidden for url: https://www.psacard.com/Pop/GetSetItems

Sorry for the long message but I wanted to give as much context as possible!

Hopefully you can help resolve this!

I think the 403 might come from Cloudflare blocking the request.

ChrisMuir commented 1 month ago

Hello! Thanks for reporting this. Yeah the 403 is definitely a result of PSA using CloudFlare. The requests are being blocked because they're missing some required cookies in the req header. When I navigate to a pop report page (I'm using https://www.psacard.com/pop/baseball-cards/2018/topps-update/161401 locally as an example), I can see the GetSetItems XHR call in the network calls of the DevTools UI in Chrome. I can right-click that call, copy as CURL (which contains all the required cookie headers AND has a legit UA), run the curl in a terminal, and get back the pop report json, just like the Python program used to be able to do.

This project used to use Selenium and a WebDriver to scrape PSA data, if I were to go back to using that I'm fairly certain this would work again. Using that approach, the web driver should have all the required headers to get around CloudFlare.

This is what the project/code looked like with Selenium.

I might at some point try to get that working again, but probably won't get to it for a while. I don't have the drive or desire to keep up with the PSA website changes.

ChrisMuir commented 1 month ago

I spent some time today getting the Selenium + Webdriver solution working again on the pop report pages. However, the pagination is completely not working on ChromeDriver. I can see the pagination at the bottom of the page, and I can point selenium at those page elements, but the page will not load anything past the first page. Even when I load the page in ChromeDriver and manually try to click thru to page 2, I can interact with the pagination elements, but nothing beyond page 1 will load.

I went to download PhantomJS, I've used that years ago for headless driver scraping, but that project was archived in 2018 :(

I don't know if pagination not working is intentional (CloudFlare detects it's a webdriver), or it's just a bug.

I'm walking away from this for now.