adamlwgriffiths / amazon_scraper

Provides content not accessible through the standard Amazon API
Other
234 stars 60 forks source link

Page sometimes not loading? #25

Closed mattrocklage closed 8 years ago

mattrocklage commented 8 years ago

I'm having sporadic trouble when extracting the ASIN using reviews/full_review:

rs = amzn.reviews(ItemId='006001203X')
fr = r.full_review()
myfile.write("%s," % (fr.asin))

I'm sometimes getting the error:

asin = unicode(tag.string)
AttributeError: 'NoneType' object has no attribute 'string'

My guess is that I'm not getting the content of the page when this error is occurring because the individual review's URL is passed on correctly (fr.url) and I can see that the content exists in my browser, but I am getting "None" when asking for the text of the review (fr.text). Furthermore, sometimes the scraper errors on a specific review and sometimes it doesn't, again making me think this is a loading issue.

In case it helps, I'm using the scraper in conjunction with Tor and PySocks (maybe not necessary?). What would lead to pages sometimes not loading? Any solutions to this issue?

_UPDATE: _

Here is some output when just printing out the reviews (rather than writing them). The format is the review URL followed by the text. What you'll notice is that "None" just seems to appear randomly and when you visit the actual page, there is writing there.

http://www.amazon.com/review/R1GLFST9IJDL3Z
None
http://www.amazon.com/review/R3O5KSEJ5BONJ7
Written by Dr. Atkins, this book is definitely a good way to get started on the diet. My only reservation is that he spends an awful long time convincing the reader to start the diet. But a good resource for a low/no carb diet.
http://www.amazon.com/review/R353I88IYNVGZJ
Thank you it is what I was looking for
http://www.amazon.com/review/R22GIPYTEYX7IK
None

Also, I have seen this happen both with and without using Tor/PySocks.

adamlwgriffiths commented 8 years ago

It's difficult to rely on code from your browser. The user agent string really makes a big difference to what amazon will send you. The best way to check is to print out obj.soup (or save to a file). You'll notice it can be quite different from what you're browser gets. Often it's a totally different layout based on user agent. Sometimes it's the soup parser trying to "fix" bad HTML. Often it does a good job, but sometimes it really breaks the HTML by prematurely closing tags. Sometimes changing the HTML parser that is used can fix things, but I've found the built in HTML parser far superior to even the html5lib parser, which is claimed to be better. And it's far FAR better than LXML which seems to sulk in a corner if HTML tags aren't closed.

If it were failing to download, the exception would be a requests/httplib exception. It should also retry the download if it fails.

I have noticed errors start to increase if you scrape rapidly. Amazon begin to return error codes, but again this should trigger an http exception. Plus I have the retry code in there to help with this.

Quite often what you find is Amazon A/B testing / throwing off scrapers by sending different HTML randomly. They also change HTML templates frequently and move content inside iframes. It looks like either the asin extraction logic isn't robust enough for some pages, or they've got some different HTML templates that the code doesn't handle at all.

Can you tell me, does a review consistently return the error? Can you also provide the soup (if there is one) in a Gist (it'll be huge) for when the error occurs. That way I can load it into the parser and try it out.

mattrocklage commented 8 years ago

The following occurred after 203 reviews. It looks like a CAPTCHA:

<!DOCTYPE html>

<!--[if lt IE 7]> <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8 a-lt-ie7"> <![endif]-->
<!--[if IE 7]>    <html lang="en-us" class="a-no-js a-lt-ie9 a-lt-ie8"> <![endif]-->
<!--[if IE 8]>    <html lang="en-us" class="a-no-js a-lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="a-no-js" lang="en-us"><!--<![endif]--><head>
<meta content="text/html; charset=utf-8" http-equiv="content-type">
<meta charset="utf-8">
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
<title dir="ltr">Robot Check</title>
<meta content="width=device-width" name="viewport">
<link href="http://z-ecx.images-amazon.com/images/G/01/AUIClients/AmazonUI-3c39b52ef832b0823a6dc102407707c29d14c9a1.min._V1_.css" rel="stylesheet">
<script>

if (true === true) {
    var ue_t0 = (+ new Date()),
        ue_csm = window,
        ue = { t0: ue_t0, d: function() { return (+new Date() - ue_t0); } },
        ue_furl = "fls-na.amazon.com",
        ue_mid = "ATVPDKIKX0DER",
        ue_sid = (document.cookie.match(/session-id=([0-9-]+)/) || [])[1],
        ue_sn = "opfcaptcha.amazon.com",
        ue_id = '1GTREZTECASBY2JZ975J';
}
</script>
</link></meta></meta></meta></meta></head>
<body>
<!--
        To discuss automated access to Amazon data please contact api-services-support@amazon.com.
        For information about migrating to our APIs refer to our Marketplace APIs at https://developer.amazonservices.com/ref=rm_c_sv, or our Product Advertising API at https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html/ref=rm_c_ac for advertising use cases.
-->
<!--
Correios.DoNotSend
-->
<div class="a-container a-padding-double-large" style="min-width:350px;padding:44px 0 !important">
<div class="a-row a-spacing-double-large" style="width: 350px; margin: 0 auto">
<div class="a-row a-spacing-medium a-text-center"><i class="a-icon a-logo"></i></div>
<div class="a-box a-alert a-alert-info a-spacing-base">
<div class="a-box-inner">
<i class="a-icon a-icon-alert"></i>
<h4>Enter the characters you see below</h4>
<p class="a-last">Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.</p>
</div>
</div>
<div class="a-section">
<div class="a-box a-color-offset-background">
<div class="a-box-inner a-padding-extra-large">
<form action="/errors/validateCaptcha" method="get" name="">
<input name="amzn" type="hidden" value="Nrt5O147Du+w8h8rG4cJ+g=="/><input name="amzn-r" type="hidden" value="/review/R6YR5THLNZBF"/><input name="amzn-pt" type="hidden" value="CustomerReviews"/>
<div class="a-row a-spacing-large">
<div class="a-box">
<div class="a-box-inner">
<h4>Type the characters you see in this image:</h4>
<div class="a-row a-text-center">
<img src="http://ecx.images-amazon.com/captcha/docvmtpr/Captcha_uhoyoghlbo.jpg">
</img></div>
<div class="a-row a-spacing-base">
<div class="a-row">
<div class="a-column a-span6">
</div>
<div class="a-column a-span6 a-span-last a-text-right">
<a onclick="window.location.reload()">Try different image</a>
</div>
</div>
<input autocapitalize="off" autocomplete="off" autocorrect="off" class="a-span12" id="captchacharacters" name="field-keywords" placeholder="Type characters" spellcheck="false" type="text">
</input></div>
</div>
</div>
</div>
<div class="a-section a-spacing-extra-large">
<div class="a-row">
<span class="a-button a-button-primary a-span12">
<span class="a-button-inner">
<button class="a-button-text" type="submit">Continue shopping</button>
</span>
</span>
</div>
</div>
</form>
</div>
</div>
</div>
</div>
<div class="a-divider a-divider-section"><div class="a-divider-inner"></div></div>
<div class="a-text-center a-spacing-small a-size-mini">
<a href="http://www.amazon.com/gp/help/customer/display.html/ref=footer_cou?ie=UTF8&amp;nodeId=508088">Conditions of Use</a>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<span class="a-letter-space"></span>
<a href="http://www.amazon.com/gp/help/customer/display.html/ref=footer_privacy?ie=UTF8&amp;nodeId=468496">Privacy Policy</a>
</div>
<div class="a-text-center a-size-mini a-color-secondary">
          © 1996-2014, Amazon.com, Inc. or its affiliates
          <script>
           if (true === true) {
             document.write('<img src="http://fls-na.amaz'+'on.com/'+'1/oc-csi/1/OP/requestId=1GTREZTECASBY2JZ975J&js=1" />');
           };
          </script>
<noscript>
<img src="http://fls-na.amazon.com/1/oc-csi/1/OP/requestId=1GTREZTECASBY2JZ975J&amp;js=0"/>
</noscript>
</div>
</div>
<script>
    if (true === true) {
        var elem = document.createElement("script");
        elem.src = "https://images-na.ssl-images-amazon.com/images/G/01/csminstrumentation/csm-captcha-instrumentation.min._V" + (+ new Date()) + "_.js";
        document.getElementsByTagName('head')[0].appendChild(elem);
    }
    </script>
</body></html>
adamlwgriffiths commented 8 years ago

I thought that might be what was occurring - I've had similar experience with mass scraping Amazon, but I've never seen the HTML first hand.

There's not much I can do about it. There is retry code in there with a pause between attempts. But once Amazon start blacklisting you as a scraper, the delay needs to be quite large to get removed from the list.

Perhaps adding a method to detect the captcha would be good. This could provide feedback to the user.

These look like tell-tale signs of a captcha page.

<title dir="ltr">Robot Check</title> <form action="/errors/validateCaptcha" method="get" name=""> <img src="http://ecx.images-amazon.com/captcha/docvmtpr/Captcha_uhoyoghlbo.jpg"> <input autocapitalize="off" autocomplete="off" autocorrect="off" class="a-span12" id="captchacharacters" name="field-keywords" placeholder="Type characters" spellcheck="false" type="text"> </input>

mattrocklage commented 8 years ago

Ah, I see. Thanks for your help. Two quick questions: 1) when you've done mass scraping before, how have you gone about getting around that problem? 2) Is it possible to use the "reviews" (not the "review") method from the script to get the same information? Here mass scraping would be minimized by only asking for the larger pages and taking individual reviews from the larger page.

adamlwgriffiths commented 8 years ago

Any reviews/products that fail, mark and scrape later. Just keep the ID of the resource. If it fails too many times, then it could be a bad template and you may want to drop it or manually check.

Reviews scrapes the paginated review list. Eg. http://www.amazon.com/Logitech-920-002232-Gaming-Keyboard-G110/product-reviews/B002RRLQIO/ref=cm_cr_dp_see_all_summary?ie=UTF8&showViewpoints=1&sortBy=helpful

You should be able to get equivalent review content from the paginated version. The main reason to scrape a specific review page would be to rescrape specific reviews later.

mattrocklage commented 8 years ago

As a follow up, I'm looking through the code and trying to see if you've already written code to extract the reviews from the paginated version. It looks like maybe not?

adamlwgriffiths commented 8 years ago

https://github.com/adamlwgriffiths/amazon_scraper/blob/master/amazon_scraper/reviews.py

reviews.py is related to the paginated review content. Reviews handles the pagination and provides methods to get each 'sub review' SubReview is a thin wrapper that takes the soup from the page and extracts review data.

review.py is individial, single page, reviews.

adamlwgriffiths commented 8 years ago

Iterating over the reviews page will give you the sub reviews. The following example from the README demonstrates this.

>>> p = amzn.lookup(ItemId='B0051QVF7A')
>>> rs = p.reviews()
>>> rs.asin
B0051QVF7A
>>> # print the reviews on this first page
>>> rs.ids
['R3MF0NIRI3BT1E', 'R3N2XPJT4I1XTI', 'RWG7OQ5NMGUMW', 'R1FKKJWTJC4EAP', 'RR8NWZ0IXWX7K', 'R32AU655LW6HPU', 'R33XK7OO7TO68E', 'R3NJRC6XH88RBR', 'R21JS32BNNQ82O', 'R2C9KPSEH78IF7']
>>> rs.url
http://www.amazon.com/product-reviews/B0051QVF7A/ref=cm_cr_pr_top_sort_recent?&sortBy=bySubmissionDateDescending
>>> # iterate over reviews on this page only
>>> for r in rs.brief_reviews:
>>>     print(r.id)
'R3MF0NIRI3BT1E'
'R3N2XPJT4I1XTI'
'RWG7OQ5NMGUMW'
...
>>> # iterate over all brief reviews on all pages
>>> for r in rs:
>>>     print(r.id)
'R3MF0NIRI3BT1E'
'R3N2XPJT4I1XTI'
'RWG7OQ5NMGUMW'

Versus details reviews:

>>> rs = amzn.reviews(ItemId='B0051QVF7A')
>>> # this will iterate over all reviews on all pages
>>> # each review will require a download as it is on a seperate page
>>> for r in rs.full_reviews():
>>>     print(r.id)
'R3MF0NIRI3BT1E'
'R3N2XPJT4I1XTI'
'RWG7OQ5NMGUMW'
...

I appreciate it can be annoying to go through the long README, but I don't have the time to make proper documentation at the moment.

Again, if you have any questions, please feel free to open an issue. There's no chat functionality on github so I think its fine to open an issue instead. My libraries are not popular enough to show up on stackoverflow.

mattrocklage commented 8 years ago

Sorry about that. When I was asking it to print the URL for each review text, it was telling me the URL for each review so I interpreted that as it was still taking it from each review page separately, but it sounds like that's not the case for this code for instance:

    p = amzn.lookup(ItemId='1582348251')
    rs = p.reviews()
    for review in rs:
        print review.text
        print review.url

Thanks again for all of your help!

adamlwgriffiths commented 8 years ago

Correct, that's the url you can use to get the individual review. I guess from the point of serialisation it could be confusing, as the de-serialised version would result in the actual review page, rather than the paginated version that was used to create it. But the paginated version will be influx, as reviews are added / upvoted, they will shift position.

But I digress =P

Glad to be of help. I'll close this and create a new issue for the captcha detection.