cubiclesoft / ultimate-web-scraper

A PHP library/toolkit designed to handle all of your web scraping needs under a MIT or LGPL license. Also has web server and WebSocket server classes for building custom servers.
447 stars 113 forks source link

How to interact with non-form elements #2

Closed emersonthis closed 9 years ago

emersonthis commented 9 years ago

I'm trying to scrape a page that has a reCAPTCHA: http://montreal.en.craigslist.ca/reply/mon/apa/5093291921

There's not image to identify, but the "checkbox" is not a real HTML checkbox input. Rather it's a div with a callback attached to it. Is there a way to "click" that div with the scraper to reveal the subsequent information?

cubiclesoft commented 9 years ago

Oddly, the link supplied did not have reCAPTCHA on it as far as I can tell from one location but at a different location/browser I do see reCAPTCHA. To me, this means that there are rules in place behind the scenes that decide that reCAPTCHA should be shown to the user. Such things can generally be exploited.

reCAPTCHA itself is HARD to get around. Even if you manage to click the "checkbox", it doesn't mean that reCAPTCHA won't do the whole CAPTCHA thing. The "I'm not a robot" checkbox is connected to Google Accounts - if you aren't signed into your Google Account, then you have to solve a CAPTCHA. Even if you are signed in, the checkbox is (probably) rate limited.

A different approach could be to find out the mechanism by which you get to see the content without reCAPTCHA showing in the first place. Maybe that involves signing into Craigslist or using a different IP address.

The Ultimate Web Scraper Toolkit covers about 95-98% of scraping needs. When heavy Javascript enters the picture, however, things get hard fast. If you need full Javascript and DOM rendering, then PhantomJS is your best bet. At least until someone gets the stupid idea to write a complete Javascript parser and DOM manipulator in pure PHP. Well, that's not entirely accurate but someone did write a parser. However, it is large, slow, buggy, and uses regular expressions - not the correct way to write a language parser. Beating reCAPTCHA will also require OCR software and the success rate will be rather low.

Craigslist put reCAPTCHA there for a reason: To keep people from scraping those pages. However, based on my experience, there is a way that apparently bypasses it altogether. Figure out what that is and you can use the toolkit to get the data you are after.

emersonthis commented 9 years ago

Thanks for the thoughtful response!

On Jul 1, 2015, at 1:07 AM, CubicleSoft notifications@github.com wrote:

Oddly, the link supplied did not have reCAPTCHA on it as far as I can tell from one location but at a different location/browser I do see reCAPTCHA. To me, this means that there are rules in place behind the scenes that decide that reCAPTCHA should be shown to the user. Such things can generally be exploited.

reCAPTCHA itself is HARD to get around. Even if you manage to click the "checkbox", it doesn't mean that reCAPTCHA won't do the whole CAPTCHA thing. The "I'm not a robot" checkbox is connected to Google Accounts - if you aren't signed into your Google Account, then you have to solve a CAPTCHA. Even if you are signed in, the checkbox is (probably) rate limited.

A different approach could be to find out the mechanism by which you get to see the content without reCAPTCHA showing in the first place. Maybe that involves signing into Craigslist or using a different IP address.

The Ultimate Web Scraper Toolkit covers about 95-98% of scraping needs. When heavy Javascript enters the picture, however, things get hard fast. If you need full Javascript and DOM rendering, then PhantomJS is your best bet. At least until someone gets the stupid idea to write a complete Javascript parser and DOM manipulator in pure PHP. Well, that's not entirely accurate but someone did write a parser. However, it is large, slow, buggy, and uses regular expressions - not the correct way to write a language parser. Beating reCAPTCHA will also require OCR software and the success rate will be rather low.

Craigslist put reCAPTCHA there for a reason: To keep people from scraping those pages. However, based on my experience, there is a way that apparently bypasses it altogether. Figure out what that is and you can use the toolkit to get the data you are after.

— Reply to this email directly or view it on GitHub.