Closed emersonthis closed 9 years ago
Oddly, the link supplied did not have reCAPTCHA on it as far as I can tell from one location but at a different location/browser I do see reCAPTCHA. To me, this means that there are rules in place behind the scenes that decide that reCAPTCHA should be shown to the user. Such things can generally be exploited.
reCAPTCHA itself is HARD to get around. Even if you manage to click the "checkbox", it doesn't mean that reCAPTCHA won't do the whole CAPTCHA thing. The "I'm not a robot" checkbox is connected to Google Accounts - if you aren't signed into your Google Account, then you have to solve a CAPTCHA. Even if you are signed in, the checkbox is (probably) rate limited.
A different approach could be to find out the mechanism by which you get to see the content without reCAPTCHA showing in the first place. Maybe that involves signing into Craigslist or using a different IP address.
The Ultimate Web Scraper Toolkit covers about 95-98% of scraping needs. When heavy Javascript enters the picture, however, things get hard fast. If you need full Javascript and DOM rendering, then PhantomJS is your best bet. At least until someone gets the stupid idea to write a complete Javascript parser and DOM manipulator in pure PHP. Well, that's not entirely accurate but someone did write a parser. However, it is large, slow, buggy, and uses regular expressions - not the correct way to write a language parser. Beating reCAPTCHA will also require OCR software and the success rate will be rather low.
Craigslist put reCAPTCHA there for a reason: To keep people from scraping those pages. However, based on my experience, there is a way that apparently bypasses it altogether. Figure out what that is and you can use the toolkit to get the data you are after.
Thanks for the thoughtful response!
On Jul 1, 2015, at 1:07 AM, CubicleSoft notifications@github.com wrote:
Oddly, the link supplied did not have reCAPTCHA on it as far as I can tell from one location but at a different location/browser I do see reCAPTCHA. To me, this means that there are rules in place behind the scenes that decide that reCAPTCHA should be shown to the user. Such things can generally be exploited.
reCAPTCHA itself is HARD to get around. Even if you manage to click the "checkbox", it doesn't mean that reCAPTCHA won't do the whole CAPTCHA thing. The "I'm not a robot" checkbox is connected to Google Accounts - if you aren't signed into your Google Account, then you have to solve a CAPTCHA. Even if you are signed in, the checkbox is (probably) rate limited.
A different approach could be to find out the mechanism by which you get to see the content without reCAPTCHA showing in the first place. Maybe that involves signing into Craigslist or using a different IP address.
The Ultimate Web Scraper Toolkit covers about 95-98% of scraping needs. When heavy Javascript enters the picture, however, things get hard fast. If you need full Javascript and DOM rendering, then PhantomJS is your best bet. At least until someone gets the stupid idea to write a complete Javascript parser and DOM manipulator in pure PHP. Well, that's not entirely accurate but someone did write a parser. However, it is large, slow, buggy, and uses regular expressions - not the correct way to write a language parser. Beating reCAPTCHA will also require OCR software and the success rate will be rather low.
Craigslist put reCAPTCHA there for a reason: To keep people from scraping those pages. However, based on my experience, there is a way that apparently bypasses it altogether. Figure out what that is and you can use the toolkit to get the data you are after.
— Reply to this email directly or view it on GitHub.
I'm trying to scrape a page that has a reCAPTCHA: http://montreal.en.craigslist.ca/reply/mon/apa/5093291921
There's not image to identify, but the "checkbox" is not a real HTML checkbox input. Rather it's a div with a callback attached to it. Is there a way to "click" that div with the scraper to reveal the subsequent information?