adamlwgriffiths / amazon_scraper

Provides content not accessible through the standard Amazon API
Other
234 stars 60 forks source link

Random stopping on multiples of 10 #29

Closed mattrocklage closed 8 years ago

mattrocklage commented 8 years ago

I am randomly having stops occur as I try to scrape all the reviews for a product (using 'reviews'). I don't receive an error, and the 'soup' output for the last review it scraped doesn't seem all that informative (i.e., it looks like the reviews before it). This doesn't happen for specific products in particular nor does it occur after the same review each time. Specifically, sometimes if I re-scrape the same product, it starts or stops at a different review and sometimes all of the reviews will be scraped. If I am scraping multiple products one-after-the-other, the scraper will just continue on to the next product, even though it didn't scrape all the reviews for the first product. Finally, the scraper tends to stop on multiples of 10 (e.g., 80, 110, etc). This makes me believe it has something to do with continuing on to the next page.

Here is the code I'm using (along with a product ID where the scraper randomly stopped):

p = amzn.lookup(ItemId='B008LX6OC6') #also ItemID='B000F8EUFI'
rs = p.reviews()
for review in rs:
    print review.asin
    print review.url
    print review.soup
adamlwgriffiths commented 8 years ago

I'm quite busy so can't directly check. You are correct about the multiples, there are N preview reviews per page. If you're getting stopped always on a multiple, it would suggest the next page code is failing. It's very possible there are issues due to the nature of scraping and amazon constantly changing HTML layouts.

The way to check this is to keep the rs object when iteration fails on a multiple of N, then print out the soup. Dump it here or in a Gist and we can look further into it.

The 'next page' code always tries to find the URL from an anchor tag for the next page along. I would hazard a guess that the next page BeautifulSoup calls aren't working with a new form of HTML layout. It's also possible you're scraping a category I haven't tried with, as different parts of the Amazon website also use different HTMl layouts. It's a bit of a nightmare.

As an aside, are you doing a lot of scraping before this happens? Is it possible it's a robot / captcha check kicking in? See this issue for some information. https://github.com/adamlwgriffiths/amazon_scraper/issues/25

If you've done a lot in the past it can take a while to 'cool down'. If it were captcha, I would expect errors in other areas though, not just review iteration.

mattrocklage commented 8 years ago

I'd be happy to check the rs.soup. How can I get the soup for each new page it obtains?

print rs.soup gives me the very first page, but how do I get the subsequent soups for each next page?

adamlwgriffiths commented 8 years ago

Ah, good point. Instead of iterating over rs, check rs.brief_reviews for its length, it's a generator so you'll need to make it into a list, len(list(rs.brief_reviews)). Next page can be retrieved manually with rs = Reviews(api, URL=rs.next_page_url)

mattrocklage commented 8 years ago

I'm sorry, that's not clear to me. I get the error "NameError: name 'api' is not defined" with the following code:

p = amzn.lookup(ItemId='B008LX6OC6')
rs = p.reviews(api, URL=rs.next_page_url)
print len(list(rs.brief_reviews))
adamlwgriffiths commented 8 years ago

api will be the amzn object you instantiated initially.

mattrocklage commented 8 years ago

I get the error "NameError: name 'rs' is not defined" with the following code:

from amazon_scraper import AmazonScraper
from amazon_scraper import reviews

amzn = AmazonScraper("XXXX", "XXXX", "XXXX")
p = amzn.lookup(ItemId='B008LX6OC6')
rs = p.reviews(api=amzn, URL=rs.next_page_url)
print len(list(rs.brief_reviews))

Will this code work correctly even if rs is defined? I'm not sure I understand how this iterates across pages and displays the soup.

adamlwgriffiths commented 8 years ago

You cannot pass rs to itself, it's not defined yet =P If you manually create a Reviews object (amazon_scraper.Reviews), then you need to pass in the api object (amzn), but if you call it from the amzn object itself, it will pass itself in for you.

This should do it.

from amazon_scraper import AmazonScraper
from amazon_scraper import reviews

amzn = AmazonScraper("XXXX", "XXXX", "XXXX")
# get the product
p = amzn.lookup(ItemId='B008LX6OC6')
# get the reviews page
rs = p.reviews()
# begin scanning review page
while rs.next_page_url:
    # get next review page
    rs = amzn.reviews(URL=rs.next_page_url)

# this review doesn't have any more pages
print(rs.url)
print(rs.next_page_url)
print(rs.soup)
mattrocklage commented 8 years ago

Yep, it's a CAPTCHA once again:

<!DOCTYPE html>

<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]>    <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]>    <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="en-us"><!--<![endif]--><head>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
<meta charset="utf-8">
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
<title dir="ltr">Robot Check</title>
<meta content="width=device-width" name="viewport">
<link href="http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonUI-3c39b52ef832b0823a6dc102407707c29d14c9a1.min._V1_.css" rel="stylesheet">
<script>

if (true === true) {
    var ue_t0 = (+ new Date()),
        ue_csm = window,
        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
        ue_furl = "fls-na.amazon.com",
        ue_mid = "ATVPDKIKX0DER",
        ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
        ue_sn = "opfcaptcha.amazon.com",
        ue_id = '0V5H1SRSJ8XH9MV6959W';
}
</script>
</link></meta></meta></meta></meta></head>
<body>
<!--
        To discuss automated access to Amazon data please contact api-services-support@amazon.com.
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
<!--
Correios.DoNotSend
-->
<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">
<div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">
<div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>
<div class="a-box a-alert a-alert-info a-spacing-base">
<div class="a-box-inner">
<i class="a-icon a-icon-alert"></i>
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
<div class="a-section">
<div class="a-box a-color-offset-background">
<div class="a-box-inner a-padding-extra-large">
<form action="/errors/validateCaptcha" method="get" name="">
<input name="amzn" type="hidden" value="/hwieKbRvn0ObRRJqdNi+g=="/><input name="amzn-r" type="hidden" value="/Dirt-Devil-Dynamite-Bagless-M084650RED/product-reviews/B000F8EUFI/ref=cm_cr_arp_d_paging_btm_9?ie=UTF8&amp;pageNumber=9&amp;sortBy=bySubmissionDateDescending"/><input name="amzn-pt" type="hidden" value="NoPageType"/>
<div class="a-row a-spacing-large">
<div class="a-box">
<div class="a-box-inner">
<h4>Type the characters you see in this image:</h4>
<div class="a-row a-text-center">
<img src="http://ecx.images-amazon.com/captcha/qujzzelu/Captcha_xsukjijfmx.jpg">
</img></div>
<div class="a-row a-spacing-base">
<div class="a-row">
<div class="a-column a-span6">
</div>
<div class="a-column a-span6 a-span-last a-text-right">
<a onclick="window.location.reload()">Try different image</a>
</div>
</div>
<input autocapitalize="off" autocomplete="off" autocorrect="off" class="a-span12" id="captchacharacters" name="field-keywords" placeholder="Type characters" spellcheck="false" type="text">
</input></div>
</div>
</div>
</div>
<div class="a-section a-spacing-extra-large">
<div class="a-row">
<span class="a-button a-button-primary a-span12">
<span class="a-button-inner">
<button class="a-button-text" type="submit">Continue shopping</button>
</span>
</span>
</div>
</div>
</form>
</div>
</div>
</div>
</div>
<div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>
<div class="a-text-center a-spacing-small a-size-mini">
<a href="http://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&amp;nodeId=508088">Conditions of Use</a>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<a href="http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&amp;nodeId=468496">Privacy Policy</a>
</div>
<div class="a-text-center a-size-mini a-color-secondary">
          © 1996-2014, Amazon.com, Inc. or its affiliates
          <script>
           if (true === true) {
             document.write('<img src="http://fls-na.amaz'+'on.com/'+'1/oc-csi/1/OP/requestId=0V5H1SRSJ8XH9MV6959W&js=1" />');
           };
          </script>
<noscript>
<img src="http://fls-na.amazon.com/1/oc-csi/1/OP/requestId=0V5H1SRSJ8XH9MV6959W&amp;js=0"/>
</noscript>
</div>
</div>
<script>
    if (true === true) {
        var elem = document.createElement("script");
        elem.src = "https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";
        document.getElementsByTagName('head')[0].appendChild(elem);
    }
    </script>
</body></html>
adamlwgriffiths commented 8 years ago

Yeah I need to add a check for the captcha page. I just don't have time to play around with this library at the moment.

If you find the issue isn't captcha related, please re-open.