fishy / url2epub

Create ePub files from URLs
BSD 3-Clause "New" or "Revised" License
69 stars 7 forks source link

Images not showing on sanjosespotlight.com #2

Closed fishy closed 3 years ago

fishy commented 3 years ago

Example URL: https://sanjosespotlight.com/san-jose-legends-rod-diridon-launched-the-citys-light-rail-but-got-into-transit-by-accident/

This is actually an interesting case. The site doesn't use AMP, and their images look like this:

<img loading="lazy"   width="1024" height="683" alt="" data-src="https://sanjosespotlight.s3.us-east-2.amazonaws.com/wp-content/uploads/2020/12/26233502/Rod-Diridon-3.jpeg-1024x683.jpg" class="size-large wp-image-66602 lazyload" src="data:image/gif;base64,R0lGODlhAQABAAAAACH5BAEKAAEALAAAAAABAAEAAAICTAEAOw=="><noscript><img loading="lazy" src="https://sanjosespotlight.s3.us-east-2.amazonaws.com/wp-content/uploads/2020/12/26233502/Rod-Diridon-3.jpeg-1024x683.jpg" class="size-large wp-image-66602" width="1024" height="683" alt=""></noscript>

So basically they try to put a placeholder image there, lazy loading the actual image async.

I think a potential solution is to try to get the noscript -> img tag inside img tag if it's there.

fishy commented 3 years ago

OK since they doesn't close the outer img tag, go's html parser doesn't treat the noscript tag as the children of the outer img tag, so this approach doesn't work. Closing as won't fix.