Bunsly / HomeHarvest

Python package for scraping real estate property data
https://tryhomeharvest.com/
MIT License
332 stars 75 forks source link

Zillow: 403 Forbidden #15

Closed ddxv closed 1 year ago

ddxv commented 1 year ago

Python 3.10.11 Versions tested: 0.2.13

What I tried to do:

properties: pd.DataFrame = scrape_property(
    site_name=["zillow", "realtor.com", "redfin"],
    location="85281",
    listing_type="for_rent" # for_sale / sold
)

Output: requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.zillow.com/homes/for_rent/85281_rb/

Testing this URL in browser works OK.

cullenwatson commented 1 year ago

Try a proxy with proxy=, also you can use tryhomeharvest.com

ddxv commented 1 year ago

I see, I don't have a proxy on hand at the moment, but just curious, do you think changing the UA that homeharvest is using would help? What UA is it currently set to? I have only used the library once a couple days ago, so would be surprised if the blocking rule is based on IP only.

cullenwatson commented 1 year ago

The headers are located here at the bottom: https://github.com/ZacharyHampton/HomeHarvest/blob/master/homeharvest/core/scrapers/zillow/__init__.py

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36

It may be the cookie as well. When I made it, if I didn't have the cookie, the requests wouldn't work. Any insights @ZacharyHampton

image
cullenwatson commented 1 year ago

It's def the cookie. I changed it to an updated one from my browser and it's good now on every request. The old cookie only worked on certain ips. Maybe time-based tho not sure if it's a long-term solution.

cullenwatson commented 1 year ago

I believe we can fix this by fetching the cookies on an initial request & dynamically setting it to a fresh cookie every time for the backend endpionts

ddxv commented 1 year ago

I believe we can fix this by fetching the cookies on an initial request & dynamically setting it to a fresh cookie every time for the backend endpionts

Ah, perfect, that is the only thing I would have suggested as well. Thanks for the link above as well to init.py. Will also check the PR to see where fixes were made. Cheers

cullenwatson commented 1 year ago

Should be working good, you can pip upgrade to get latest changes. Let me know

ddxv commented 1 year ago

Yes, pulled newest changes and checked that last commit. I'm far from an expert here, but looks good to me. Also checked and working again. Thanks for the help!