adamlwgriffiths / amazon_scraper

Provides content not accessible through the standard Amazon API
Other
235 stars 60 forks source link

Add captcha detection #26

Open adamlwgriffiths opened 8 years ago

adamlwgriffiths commented 8 years ago

Detect when amazon shoots a captcha at us and raise an appropriate error instead of letting our soup code fail with None dereferences.

See this issue for more information on the format of the captcha and result of it being sent: https://github.com/adamlwgriffiths/amazon_scraper/issues/25

benperove commented 8 years ago

I've been running into the same problem with another app that I'm building, which is kind of big problem.

I've added captcha detection, and instead of just throwing an error, I've added an interface to deathbycaptcha for getting the captcha solved in a relatively short amount of time (15-20 seconds). All of this seems to work pretty good in my initial testing.

Hopefully I'll be able to make a pull request within the next few days.

benperove commented 8 years ago

This has been tested a fair amount with success and is ready to go.

@adamlwgriffiths Can I have your permission to create a new branch?

adamlwgriffiths commented 8 years ago

Awesome! I'll try and get some time to take a look. Hopefully within 24 hours.

Why a new branch? If it works we can just merge to master and up the major or minor version - depending on if there's any function changes / new exceptions.

On 01/07/2016, at 10:21 PM, Benjamin notifications@github.com wrote:

This has been tested a fair amount with success and is ready to go.

@adamlwgriffiths Can I have your permission to create a new branch?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

adamlwgriffiths commented 8 years ago

Ah sorry I assumed code was already included. I'm currently mobile so not really able to interact fully.

I guess it would be a major version since its new dependencies which could break unsuspecting users.

Re: new branch. Sure. Once you're happy with it well pull it to master and package it up for pypi. I'm busy atm and not using this lib currently so I can't provide much / any testing.

On 01/07/2016, at 10:21 PM, Benjamin notifications@github.com wrote:

This has been tested a fair amount with success and is ready to go.

@adamlwgriffiths Can I have your permission to create a new branch?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

adamlwgriffiths commented 8 years ago

Hmm, deathbycaptcha isn't a free service.

If you want to do this, then it should be optional. I would add a 'captcha handler' as part of the API. If the API init function takes a new parameter of captcha_handler=None. When a captcha is detected, the default None handler would throw a 'CaptchaRequiredException' or something similar. You could then include a 'DeathByCaptcha' handler which could be passed to the API init function to handle this your way.

I don't want to enforce a paid-for service in the API itself. Being a non-free service also implies there are alternatives, which means flexibility here is a good thing.

benperove commented 8 years ago

Agreed - captcha handling should be optional, as other alternatives may exist for dealing with them, or simply throwing an exception may be adequate.

The code that I was testing was filtering request responses in the same script, but this doesn't make sense for the codebase overall.

I'll take your comment into consideration, Adam, as I start examining the API more closely.

adamlwgriffiths commented 8 years ago

The codebase isn't exactly anything to write home about. I wouldn't go overboard engineering a great solution, just something simple, logical, and flexible.

I think if each of the soup @property methods was changed it would be easier. At the moment they call to a global get function which does rate limiting. I think the entirety of the soup functions should be promoted to the core __init__ file. After the BeautifulSoup code is loaded, a check can be put in place for a captcha page. If detected, the API.captcha_handler or what-have-you is then called to see if it can be handled. appriately. If handled, the function can have another go at downloading the HTML and reparsing it.

This way, all classes can seemlessly get the captcha detection/handling.

benperove commented 8 years ago

I've made some good progress within the last day. All requests run through the API and are filtered (via both regex and the bs4 interface). I've also added a hook for whatever action/service people may wish to use in order to deal with captchas whenever detected, otherwise defaulting to writing a log message.

One tiny bug that I've encountered - I'm seeing loop behavior in logic where I would not expect. Should be quick to get that ironed out though.

Can I have permission to create a branch in order to save my work?

adamlwgriffiths commented 8 years ago

Ahhh, if you just fork the repo you can commit all you want.

benperove commented 8 years ago

I was finally was able to make a commit on the captcha plugin.

https://github.com/benperove/amazon_scraper/commit/df2752df48bdc165c20163cbad83b251f5b3a251

adamlwgriffiths commented 8 years ago

Hey Ben, I added some comments inline in that commit. I think the coupling to DBC is too tight. As I said above, there should just be a basic captcha handler which just detects the captcha and throws an exception. The user should pass the DBC handler in at API construction time (if not passed in, it should fall back to the default mentioned above). The amazon core API should have no knowledge of DBC, only that there are potential captchas and that there are methods to handle them.

If you want I can try and find some time to add a basic framework for this that you can then plug your DBC into. That said, I'm pretty busy at the moment.