Open aintHuman opened 2 years ago
The HTML for the asin in the readme has this:
data-a-dynamic-image="{"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX522_.jpg":[522,522],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX342_.jpg":[342,342],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX679_.jpg":[679,679],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX425_.jpg":[425,425],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX466_.jpg":[466,466],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX569_.jpg":[569,569],"https://m.media-amazon.com/images/I/71UItVa0VmL._AC_SX385_.jpg":[385,385]}"
Your ASIN has this in HTML:
data-a-dynamic-image="{"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg":[333,499],"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,204,203,200_.jpg":[231,346]}"
I fail to see how that's any different, other than that yours has commas in the image name and the start of the url is different. The actual problem is that the regex returns images-na due to the different url, so the next check will give you the wrong url.
I made a fiddle: https://jsfiddle.net/ndmt0f1z/
Regex output for your asin:
["/images-na.", "images-na"]
Regex output for asin with media-amazon domain:
["/71UItVa0VmL.", "71UItVa0VmL"]
So in conclusion, your hunch was right, but it was not due to the quotes, but due to the different image domain and the regex picking the wrong part of the url.
Some book image urls are being scraped and failing to extract correctly
For instance:
https://www.amazon.com/dp/022640613X
The main image url is:
https://images-na.ssl-images-amazon.com/images/I/51q62HOcxZL._SX347_BO1,204,203,200_.jpg
However, the output from the scrape is as follows:
https://images-na.ssl-images-amazon.com/images/I/images-na.jpg
Now I believe the issue is to do with this block
(lines 1441 - 1457 of Amazon.js)
Specifically, in reference to the above example, the following is the 'data-a-dynamic-image' attribute:
"{"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SX331_BO1,204,203,200_.jpg":[333,499],"https://images-na.ssl-images-amazon.com/images/I/51jxtcapGSL._SY344_BO1,204,203,200_.jpg":[231,346]}"
I believe it is not being parsed correctly since the string is both surrounded in rabbit-ears (ie "), and the keys inside are also using rabbit ears:
Here is the output if I try to parse using NodeJS in a terminal:
I am not sure what the best approach would be, perhaps a pre-processing step which converts the first and last double quote (if present) to single quotes, given that the JSON standard mandates double quotes:
This parses as expected, the only difference is converting the first and last double quote to single quote.
I have scraped about 650 books, and this has occurred about 20 times, so, about 3% of the time, if in fact my rather small sample is reflective of the entire amazon store.