DavidRoldan523 / amazon_reviews_allpages

Script to scrape all reviews on all Amazon pages
20 stars 10 forks source link

Only prints IN PROCESS FOR #5

Open ryanjkirkland opened 4 years ago

ryanjkirkland commented 4 years ago

When I run the script, it prints "IN PROCESS FOR: B00JD242MS", then exits the script immediately. Is this expected behavior?

StephenHnilica commented 4 years ago

Having the same issue.

Penlo commented 4 years ago

Looks like they changed the path of total-review-count as well as how the text data is formatted. This causes the variable number_reviews in get_header to not pull any reviews so it thinks the ASIN has no reviews.

Here is how I fixed it.

Just replace your whole get_header function with this one. If you want to see the changes just use an online File Difference editor or use the built in plugins for your favorite text editor.

Let me know if this helps you.

def get_header(asin):
    try:
        ratings_dict = {}
        amazon_url = 'https://www.amazon.com/product-reviews/' + asin + '/ref=cm_cr_arp_d_paging_btm_next_1?pageNumber=1'
        urllib3.disable_warnings()
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}
        response = get(amazon_url, headers=headers, verify=False, timeout=60)
        # Removing the null bytes from the response.
        cleaned_response = response.text.replace('\x00', '')
        parser_to_html = html.fromstring(cleaned_response)

        number_reviews = ''.join(parser_to_html.xpath('.//*[@data-hook="total-review-count"]//text()'))
        number_reviews_cleaned = re.sub('[^0-9]', '', number_reviews)
        product_price = ''.join(parser_to_html.xpath('.//span[contains(@class,"a-color-price arp-price")]//text()')).strip()
        product_name = ''.join(parser_to_html.xpath('.//a[@data-hook="product-link"]//text()')).strip()
        total_ratings = parser_to_html.xpath('//table[@id="histogramTable"]//tr')
        print(number_reviews_cleaned)
        for ratings in total_ratings:
            extracted_rating = ratings.xpath('./td//a//text()')
            if extracted_rating:
                rating_key = extracted_rating[0]
                raw_raing_value = extracted_rating[1]
                rating_value = raw_raing_value
                if rating_key:
                    ratings_dict.update({rating_key: rating_value})

        number_page_reviews = int(int(number_reviews_cleaned) / 10)

        if number_page_reviews % 2 == 0:
            number_page_reviews += 1
        else:
            number_page_reviews += 2

        return product_price, product_name, number_reviews_cleaned, ratings_dict, number_page_reviews
    except Exception as e:
        return {"url": amazon_url, "error": e}
tommasoferri1 commented 4 years ago

Hi @Penlo , I'm new to both web scraping and python. I still getting the IN PROCESS FOR [ASIN] and no file output despite your fix. can you help me fixing this issue? I'm trying these ASINs: B00YCP71VK , B000TGDGLU and B07JMNYK7X

Thanks, Tom

Penlo commented 4 years ago

@tommasoferri1 I found that the actual problem is the response. Amazon has bot protection for these pages now. If you print the raw response you will see that its a notification page, letting you know that they are sorry but they cannot verify you are human.

The only way I found to get around it was to use the Selenium Library with Chromium, unfortunatly this means a completely new workflow.

If you need help or have any questions please feel free to PM me directly.

tommasoferri1 commented 4 years ago

Thanks!

I've found something using selenium but nothing usable, do you have some github to suggest?

For the moment a ShrapeHero, a chrome extension, is the best thing yet, but i can not automate it to download multiple product's review.

Thanks for your help!

Bye


T

On Thu, Dec 12, 2019 at 3:25 PM Nathan notifications@github.com wrote:

@tommasoferri1 https://github.com/tommasoferri1 I found that the actual problem is the response. Amazon has bot protection for these pages now. If you print the raw response you will see that its a notification page, letting you know that they are sorry but they cannot verify you are human.

The only way I found to get around it was to use the Selenium Library with Chromium, unfortunatly this means a completely new workflow.

If you need help or have any questions please feel free to PM me directly.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/DavidRoldan523/amazon_reviews_allpages/issues/5?email_source=notifications&email_token=AOAXBNM5FQPBTBOQCO3EQBLQYJCTZA5CNFSM4JMHCYTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEGW2GOA#issuecomment-565027640, or unsubscribe https://github.com/notifications/unsubscribe-auth/AOAXBNOETTMTB4C6YGXGEY3QYJCTZANCNFSM4JMHCYTA .

--