danburzo / percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
https://danburzo.ro/projects/percollate/
MIT License
4.25k stars 165 forks source link

Use largest available size for images in Wikipedia articles #42

Open danburzo opened 5 years ago

danburzo commented 5 years ago

The idea of the imagesAtFullSize enhancement is to get the largest available image from blogs using Blogspot, WordPress, and the like:

https://github.com/danburzo/percollate/blob/3506b370fc1d54b9039a1f104c20defda7859eb8/src/enhancements.js#L1-L20

However, Wikipedia images are an exception:

<a href="/wiki/File:Perkulator.jpg" class="image">
  <img alt="" src="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/250px-Perkulator.jpg" class="thumbimage" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/375px-Perkulator.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/3/3a/Perkulator.jpg/500px-Perkulator.jpg 2x" data-file-width="1944" data-file-height="2592" width="250" height="333">
</a>

They link to what looks like an image file, but is in fact a HTML page for that image. How can we handle this situation gracefully?

bekicot commented 5 years ago

Here we go. https://upload.wikimedia.org/wikipedia/commons/3/3a/Perkulator.jpg Remove the thumb from url :)

may i help with it?

danburzo commented 5 years ago

@bekicot sure thing! I looked into it a bit and apparently the "canonical" way to get the image's original URL is to make a query to the Wikipedia API:

https://en.wikipedia.org/w/api.php?action=query&titles=File:Albert_Einstein_(Nobel).png&prop=imageinfo&iiprop=url&format=json

Maybe a good first step is just making imagesAtFullSize ignore wiki image files?

danburzo commented 5 years ago

Maybe a good first step is just making imagesAtFullSize ignore wiki image files?

I added this in the commit above.