charlotte-ruby / image_scraper

simple utility that pulls image URLs from a web page
MIT License
21 stars 10 forks source link

Image with a blank space #2

Closed jonathansimmons closed 12 years ago

jonathansimmons commented 12 years ago

This looks like a basic html issue but its causing a bad URI error:

url

http://www.amazon.com/Planet-Two-Disc-Digital-Combo-Blu-ray/dp/B004LWZW4W/ref=sr_1_1?s=movies-tv&ie=UTF8&qid=1324771542&sr=1-1

error

 (bad URI(is not URI?): %20http://g-ecx.images-amazon.com/images/G/01/SIMON/IsaacsonWalter._V164348457_.jpg):

faulty html sample

 <img height="300" src=" http://g-ecx.images-amazon.com/images/G/01/SIMON/IsaacsonWalter._V164348457_.jpg" style="float: right;" width="450"> 

Looks like the scraper is throwing the error when and image has a space and the first character of the source. Which apparently is a common mistake amazon.com makes

I can give you the full trace if that would help but it seems pretty straight forward. I played with a few lines trying to get it to just ignore the image if the first character was blank to no avail. (still new to rails.) Let me know what other information I can get you to help out.

johnmcaliley commented 12 years ago

Sorry for the delay. I didn't get the issue notification for some reason. I just pushed a new version of the gem that strips the whitespace from the img src. I added a test that included the amazon img tag you provided here and it is passing, so this should work for you. I am trying to figure out if I should just ignore all bad URLs that are collected and issue a warning message. what do you think?

jonathansimmons commented 12 years ago

Hey John, Sorry for the delay. I update the gem and it appears to be looking great.

To answer your question: In my case, if the image is bad (no url, space or any number of errors) I am fine to just ignore it. From what I've seen any bad images tend to be secondary images on the page and not the image of a product or something I'm really trying to capture.

I think skipping bad images and giving a warning would be a great idea. From a programming standpoint it could be a crazy mess to continue to try fix bugs like the last two on one off basis given the infinite number of ways html devs could screw up an img tag.

My biggest thought and this may be obvious is that I would want it to either a just skip them altogether or skip a provide a warning IF we can allow our apps to move on past the warning. Maybe we can even develop against the warnings. (if images.bad_urls "Sorry, some images couldn't be used")

Hopefully that all makes sense. Please do let me know if you'd like any other input. Our app is moving along nicely and we hope to have in done in the next 3-4 weeks. I'll keep you posted.

John McAliley mailto:reply@reply.github.com January 2, 2012 10:49 AM Sorry for the delay. I didn't get the issue notification for some reason. I just pushed a new version of the gem that strips the whitespace from the img src. I added a test that included the amazon img tag you provided here and it is passing, so this should work for you. I am trying to figure out if I should just ignore all bad URLs that are collected and issue a warning message. what do you think?


Reply to this email directly or view it on GitHub: https://github.com/charlotte-ruby/image_scraper/issues/2#issuecomment-3331281