adamlwgriffiths / amazon_scraper

Provides content not accessible through the standard Amazon API
Other
234 stars 60 forks source link

Can't get ASIN for reviews #1

Closed patrickbeeson closed 9 years ago

patrickbeeson commented 9 years ago

I'm running into an odd error when trying to get the ASIN for reviews. I get an RS object, but oddly can't get at the ASIN despite the fact it has that attribute on the span. Any help?

...
RS value: <amazon_scraper.reviews.Reviews object at 0x10c0c9890>
...

Here's my traceback:

Traceback (most recent call last):
    File "amazon_reviews.py", line 112, in <module>
scrapeAmazonReviews(filepath, title, amzn, url)
    File "amazon_reviews.py", line 83, in scrapeAmazonReviews
print "RS ASID: %s" % rs.asin
    File "/Users/pbeeson/.virtualenvs/custom_data_pulls/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 41, in asin
return unicode(span['name'])
    TypeError: 'NoneType' object has no attribute '__getitem__'
adamlwgriffiths commented 9 years ago

Could you provide the review id, (preferably ASIN too, if you have the product object).

print rs.url
print rs.id

I'll take a look after I catch some sleep =)

adamlwgriffiths commented 9 years ago

I had a quick look, the asin property works for reviews and review object. Albeit for the products I'm testing against.

Product pages html changes drastically (and continually evolves), but I've never seen any variation in the review html format. So I'm curious to see what product you're getting reviews from.

patrickbeeson commented 9 years ago

Here's a log from my interpreter (I get the same result from any product I try):

>>> amzn = AmazonScraper ('access', 'secret', 'associate_tag')
>>> item = amzn.lookup(ItemId='B00008MOQA')
>>> item.title
'Swiffer WetJet Spray, Mop Floor Cleaner Starter Kit (Packaging May Vary)'
>>> item.url
'http://www.amazon.com/dp/B00008MOQA'
>>> item.reviews_url
'http://www.amazon.com/product-reviews/B00008MOQA/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending'
>>> item_reviews = amzn.lookup(ItemId='B00008MOQA')
>>> item_reviews = amzn.reviews(URL=item.reviews_url)
>>> item_reviews.asin
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/Users/pbeeson/.virtualenvs/custom_data_pulls/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 41, in asin
return unicode(span['name'])
    TypeError: 'NoneType' object has no attribute '__getitem__'
>>> item_reviews.ids
[]
patrickbeeson commented 9 years ago

I can't get item_reviews.id or item_reviews.url without getting the same ASIN error. The ASIN for this product should be B00008MOQA.

adamlwgriffiths commented 9 years ago

Ok I can reproduce that. I'll get to work on it now.

adamlwgriffiths commented 9 years ago

Ok, the logic was fine its just that the python html parsers are a bit schizophrenic. Using bsoup4 with 'html.parser' errored only on that one page (in my tests anyway). I've changed bsoup to use the default parser for reviews and left the others. It passes all tests still. I've also added some fixes. I'll push this release up in a tick.

adamlwgriffiths commented 9 years ago

I've pushed version 0.1.17 to pypi. This should resolve the issue, if not, please re-open.

patrickbeeson commented 9 years ago

Great! I'll check it out this morning.

patrickbeeson commented 9 years ago

Just confirmed things are working as expected with the reviews I'm seeking to pull. Thanks!

igor555 commented 9 years ago

can anyone help a noob (me) set up a scraping program for warehousedeals.com

adamlwgriffiths commented 9 years ago

You can use libraries like https://github.com/scrapy/scrapy to trawl sites. It has some easy to use scraping routines based on XPath which were nice and concise, but I found the documentation lacking in the areas I needed (controlling the way it spiders websites).

Learning BeautifulSoup4 is a good start. http://www.crummy.com/software/BeautifulSoup/bs4/doc/ But it is heavily reliant on which parser your choose ('html.parser', 'html5lib', 'lxml', etc), which is what caused this issue (the value was in the html, the html parsed differently for this product because the python html parsers are all deficient in different ways).

XPath is also very handy because its concise, and having the path not exist results in None, no matter where in the query it failed. But XPath is quite hard to understand, its like regexp or perl. You wont remember what a line of code does a month later.

BeautifulSoup4 on the other hand is long winded, but prone to errors, ie

tag = soup.find('div', class_='main')
span = tag.find('span')

If tag doesn't exist, the span = line will throw an exception.

But there are no good xpath libs for python (lxml has one, I don't like lxml, using an XML parser for HTML is a bad idea).

In short, try scrapy. If that fails, try beautifulsoup 4 and lxml xpath.