Closed patrickbeeson closed 9 years ago
Could you provide the review id, (preferably ASIN too, if you have the product object).
print rs.url
print rs.id
I'll take a look after I catch some sleep =)
I had a quick look, the asin property works for reviews and review object. Albeit for the products I'm testing against.
Product pages html changes drastically (and continually evolves), but I've never seen any variation in the review html format. So I'm curious to see what product you're getting reviews from.
Here's a log from my interpreter (I get the same result from any product I try):
>>> amzn = AmazonScraper ('access', 'secret', 'associate_tag')
>>> item = amzn.lookup(ItemId='B00008MOQA')
>>> item.title
'Swiffer WetJet Spray, Mop Floor Cleaner Starter Kit (Packaging May Vary)'
>>> item.url
'http://www.amazon.com/dp/B00008MOQA'
>>> item.reviews_url
'http://www.amazon.com/product-reviews/B00008MOQA/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending'
>>> item_reviews = amzn.lookup(ItemId='B00008MOQA')
>>> item_reviews = amzn.reviews(URL=item.reviews_url)
>>> item_reviews.asin
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/pbeeson/.virtualenvs/custom_data_pulls/lib/python2.7/site-packages/amazon_scraper/reviews.py", line 41, in asin
return unicode(span['name'])
TypeError: 'NoneType' object has no attribute '__getitem__'
>>> item_reviews.ids
[]
I can't get item_reviews.id
or item_reviews.url
without getting the same ASIN error. The ASIN for this product should be B00008MOQA.
Ok I can reproduce that. I'll get to work on it now.
Ok, the logic was fine its just that the python html parsers are a bit schizophrenic. Using bsoup4 with 'html.parser' errored only on that one page (in my tests anyway). I've changed bsoup to use the default parser for reviews and left the others. It passes all tests still. I've also added some fixes. I'll push this release up in a tick.
I've pushed version 0.1.17 to pypi. This should resolve the issue, if not, please re-open.
Great! I'll check it out this morning.
Just confirmed things are working as expected with the reviews I'm seeking to pull. Thanks!
can anyone help a noob (me) set up a scraping program for warehousedeals.com
You can use libraries like https://github.com/scrapy/scrapy to trawl sites. It has some easy to use scraping routines based on XPath which were nice and concise, but I found the documentation lacking in the areas I needed (controlling the way it spiders websites).
Learning BeautifulSoup4 is a good start. http://www.crummy.com/software/BeautifulSoup/bs4/doc/ But it is heavily reliant on which parser your choose ('html.parser', 'html5lib', 'lxml', etc), which is what caused this issue (the value was in the html, the html parsed differently for this product because the python html parsers are all deficient in different ways).
XPath is also very handy because its concise, and having the path not exist results in None, no matter where in the query it failed. But XPath is quite hard to understand, its like regexp or perl. You wont remember what a line of code does a month later.
BeautifulSoup4 on the other hand is long winded, but prone to errors, ie
tag = soup.find('div', class_='main')
span = tag.find('span')
If tag doesn't exist, the span = line will throw an exception.
But there are no good xpath libs for python (lxml has one, I don't like lxml, using an XML parser for HTML is a bad idea).
In short, try scrapy. If that fails, try beautifulsoup 4 and lxml xpath.
I'm running into an odd error when trying to get the ASIN for reviews. I get an RS object, but oddly can't get at the ASIN despite the fact it has that attribute on the span. Any help?
Here's my traceback: