drawrowfly / amazon-product-api

Amazon Scraper. Scrape products from the amazon search result or reviews from the specific product
627 stars 181 forks source link

Image Url's Malformed #67

Open aintHuman opened 2 years ago

aintHuman commented 2 years ago

Some book image urls are being scraped and failing to extract correctly

For instance:

https://www.amazon.com/dp/022640613X

The main image url is:

https://images-na.ssl-images-amazon.com/images/I/51q62HOcxZL._SX347_BO1,204,203,200_.jpg

However, the output from the scrape is as follows:

https://images-na.ssl-images-amazon.com/images/I/images-na.jpg

Now I believe the issue is to do with this block (lines 1441 - 1457 of Amazon.js)

/**
         * If for example book item does have only one image
         * then {imageGalleryData} won't exist and we will use different way of extracting required data
         * Product types: books
         */
        if (!images.length) {
            const imageData = $('#imgBlkFront')[0] || $('#ebooksImgBlkFront')[0];
            if (imageData) {
                const data = imageData.attribs['data-a-dynamic-image'];
                const json = JSON.parse(data);      // << --------------------------- PROBLEM HERE
                const keys = Object.keys(json);
                const imageIdregex = /\/([\w-+]{9,13})\./.exec(keys[0]);
                if (imageIdregex) {
                    images.push(`https://images-na.ssl-images-amazon.com/images/I/${imageIdregex[1]}.jpg`);
                }
            }
        }

Specifically, in reference to the above example, the following is the 'data-a-dynamic-image' attribute:

"{"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg":[333,499],"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,204,203,200_.jpg":[231,346]}"

I believe it is not being parsed correctly since the string is both surrounded in rabbit-ears (ie "), and the keys inside are also using rabbit ears:

Here is the output if I try to parse using NodeJS in a terminal:

> JSON.parse("{"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg":[333,499],"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,204,203,200_.jpg":[231,346]}")
JSON.parse("{"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg":[333,499],"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,204,203,200_.jpg":[231,346]}")
           ^^^

Uncaught SyntaxError: missing ) after argument list
> 

I am not sure what the best approach would be, perhaps a pre-processing step which converts the first and last double quote (if present) to single quotes, given that the JSON standard mandates double quotes:

This parses as expected, the only difference is converting the first and last double quote to single quote.

JSON.parse('{"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg":[333,499],"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,204,203,200_.jpg":[231,346]}')
{ 'https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg': [ 333, 499 ], 'https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,...
  'https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg': [ 333, 499 ],
  'https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,204,203,200_.jpg': [ 231, 346 ]
}

I have scraped about 650 books, and this has occurred about 20 times, so, about 3% of the time, if in fact my rather small sample is reflective of the entire amazon store.

benjaminvanrenterghem commented 2 years ago

The HTML for the asin in the readme has this:

data-a-dynamic-image="{"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX522_.jpg":[522,522],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX342_.jpg":[342,342],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX679_.jpg":[679,679],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX425_.jpg":[425,425],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX466_.jpg":[466,466],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX569_.jpg":[569,569],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX385_.jpg":[385,385]}"

Your ASIN has this in HTML:

data-a-dynamic-image="{"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg":[333,499],"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,204,203,200_.jpg":[231,346]}"

I fail to see how that's any different, other than that yours has commas in the image name and the start of the url is different. The actual problem is that the regex returns images-na due to the different url, so the next check will give you the wrong url.

I made a fiddle: https://jsfiddle.net/ndmt0f1z/

Regex output for your asin:

["/images-na.", "images-na"]

Regex output for asin with media-amazon domain:

["/71UItVa0VmL.", "71UItVa0VmL"]

So in conclusion, your hunch was right, but it was not due to the quotes, but due to the different image domain and the regex picking the wrong part of the url.