ChrisMuir / Zillow

Zillow Scraper for Python using Selenium
162 stars 75 forks source link

CAPTCHA does not go away #9

Closed GiulioGiorcelli closed 6 years ago

GiulioGiorcelli commented 6 years ago

Hi there!

I'm using your software for a personal project and when Zillow throws up a CAPTCHA it takes a really long time and dozens of iterations to get rid of it. I basically complete the CAPTCHA and the page reloads a new one. It goes on for about 10/15 minutes no matter how many times I do it. Do know why this is happening? Is there a workaround to this issue?

Thanks, Giulio

wwetzel commented 6 years ago

Hi Giulio,

I was using this code to access Zillow for a while and would run into a similar issue. As ChrisMuir points out - scraping is against Zillow's ToS, so they are throwing a CAPTCHA to prevent bots like this one from scraping content. I haven't tried to defeat a CAPTCHA yet - the whole point is to not be beatable by bots.

Using multiple computers - throwing up a bunch of linux virtual machines, basically you're suspicious because of how much searching you're doing and the way the bot interacts with the web page - it's very not human. I don't know how Zillow tracks this but some googling would give you an idea.

Easy solutions:

  1. You can try manually monitoring the machine and interceding when a CAPTCHA appears - manually click around for a while and the site will figure out you are a person. You'd probably have to add code to track how far the bot got in its search before getting stuck.

  2. Use multiple computers and / or IP address to try and fool Zillow

ChrisMuir commented 6 years ago

Hi @GiulioGiorcelli, I don't have any good answers for you on this. I honestly haven't had much interest in this project/repo for a while now, so when I added the CAPTCHA code I didn't test it much.....I think I recall what you described happening to me once? And I didn't investigate it at the time. For me, almost all of the instances of CAPTCHA were easy to manually handle (code pauses, I beat the CAPTCHA once, it goes away, code resumes).

The short answer is that once the CAPTCHA appears, it's out of my hands. I have no interest in developing the current CAPTCHA code beyond what it currently is, which is simply to pause code execution indefinitely until the CAPTCHA been manually handled.

Hi @wwetzel, thanks for jumping in with your input and info!