adamlwgriffiths / amazon_scraper

Provides content not accessible through the standard Amazon API
Other
234 stars 60 forks source link

Fix up tests #8

Closed hahnicity closed 9 years ago

hahnicity commented 9 years ago

I'm going to try to take on fixing up the tests and getting them to work properly. Right now there are basic issues just getting the tests to run and we need to turn off the MaxQPS property in order to get them to go. Also there are three test failures and several errors that can be fixed. Hell maybe I'll even figure out how to throw travis on here; but that can be a separate ticket.

adamlwgriffiths commented 9 years ago

I've got Travis setup on Pyrr. https://github.com/adamlwgriffiths/Pyrr/blob/master/.travis.yml Can use this config as a base.

What does MaxQPS break? It should prevent breakages due to Amazon throttling. You will end up with HTTP requests returning errors if you exceed it.

I found that if you use the Amazon API to get data, Amazon also apply limits to HTTP requests. I believe they monitor IPs. If you don't use the API, you can hammer their HTTP pages for a loooong time before you get any throttling.

hahnicity commented 9 years ago

It says MaxQPS cannot be handed down to the AmazonAPI object

153 class AmazonScraper(object):
154     def __init__(self, access_key, secret_key, associate_tag, *args, **kwargs):
155         self.api = AmazonAPI(access_key, secret_key, associate_tag, *args, **kwargs)

Maybe I'm using an incorrect version of the amazon api pkg?

python-amazon-simple-product-api==1.5.0

Also I tracked down the errors to a problem with specific products on the web. When the product comes back from amazon it says they have no ratings. This is fine, I assume some products will not have ratings in general. But then when we are doing product.to_dict() we are trying to coerce product.ratings into the dict accidentally because we have modified the __getattr__ method to call self.product instead of self. Do we want this functionality. I have a few ways of working around it, none are really pretty. The real fix removing the special __getattr__ handling, but does that break anything in a terrible way?

 12 class Product(object):
 13     def __init__(self, product):
 14         self.product = product
 15         self._soup = None
 16         
 17     def __getattr__(self, name):
 18         # allow direct access to the product object
 19         return getattr(self.product, name)
hahnicity commented 9 years ago

I'm also having some difficulty with an asin validation for another amazon review. This is the review: http://www.amazon.com/review/R1P0CKXI82A1LT

And this is the main product review page: http://www.amazon.com/product-reviews/B0051QVF7A/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending

The asins are mixed up for review and reviews. We're able to get the correct asin for the reviews of

B0051QVF7A

But the single review seems to differ by saying

B007TKK5SG

I think the single review was migrated from the other product into a new products page. Now that we know that amazon does this I'm going to just remove the single review asin test because its not reliable.

adamlwgriffiths commented 9 years ago

Yeah I've come across this. Where reviews are for related products (hard cover, paper back, ebook, etc), Amazon will provide the same reviews. The review ASIN can point to a different parent product media format than what you originally searched for. It's a pain in the ass.

The different formats also do this too (product formats being paper back, hard cover, kindle, etc), quite often one format will have a different list of alternatives than others. It's quite difficult to deal with all the peculiarities.

adamlwgriffiths commented 9 years ago

| It says MaxQPS cannot be handed down to the AmazonAPI object Does this cause an error?

The product api and bottlenose both have a MaxQPS property. https://github.com/yoavaviram/python-amazon-simple-product-api/blob/master/amazon/api.py https://github.com/lionheart/bottlenose/blob/master/bottlenose/api.py

The __getattr__ is to provide seamless wrapping of the product api lib. Any property / method that doesn't exist in the wrapped class, will be redirected to the product api class.

To avoid getting an existing property from the product api class, add the name to the blacklist in to_dict.

    def to_dict(self):
        <snip>
        # add the python properties
        d.update({
            k:getattr(self.product, k)
            for k in dir(self.product)
            if dict_acceptable(self.product, k, blacklist=['browse_nodes', 'api']) # put 'ratings' here
        })

Otherwise, each access of the above could be wrapped in a function with a try/except block. This would ensure that no strange products break the class. I think doing it this way is better, we can then leave the blacklist for backend functions only, as it is now.

hahnicity commented 9 years ago

OK so I used your suggestion with the ratings blacklist and things are passing.

As for MaxQPS python-amazon-simple-product-api is 1.5.0 on Pypi but it is 2.0 on github :-1: The MaxQPS kwarg is not allowed on 1.5.0 but is on 2.0. So you'd have to build the pkg from github to actually get the MaxQPS setting. https://github.com/yoavaviram/python-amazon-simple-product-api/issues/55

I can probably add the github dependency in a requirements.txt file but im not sure how nice that plays with setuptools. I'll do some digging and get some fixes in.

adamlwgriffiths commented 9 years ago

I guess I was using a pip install from git and forgot.

You can provide pip with enough information to handle clashes and versioning.

pip install git+https://github.com/username/repo.git@MyTag#egg=ProjectName

As for the actual setup.py ....not so sure on that one =/

Maybe just comment out the MaxQPS parameters in the tests for now until 2.0 is released.

Do whatever you feel is best / works out of the box.

Edit: Just realised you had already done the requirements.txt. So please ignore that.