danburzo / percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
https://danburzo.ro/projects/percollate/
MIT License
4.26k stars 166 forks source link

Missing images in Wikipedia articles #141

Open WolfgangDpunkt opened 1 year ago

WolfgangDpunkt commented 1 year ago

Environment

Description

When I convert Wikipedia articles to epubs with this otherwise great and very useful tool, some of the images get lost. An adblocker is not used in this environment.

Here is my command line percollate epub --individual --output /home/Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug

And here is the resulting epub. I had to zip it, as Github does not accept epub files: -Canada.epub.zip

And here's the direct comparison, in the "British North America" section the web version has two images, the epub version zero.

Bildschirmfoto 2022-10-13 um 10 14 06

There are indeed images in the epub, percollate does not ignore all images, but most of them. What could be the reason? Thanks a lot!

Here comes the debug log:

~# percollate epub --individual --output /home/_Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug
{
  command: 'epub',
  operands: [ 'https://en.wikipedia.org/wiki/Canada' ],
  opts: {
    individual: true,
    output: '/home/_Perco-Epubs/',
    debug: true
  }
}
Fetching: https://en.wikipedia.org/wiki/Canada ✓
Enhancing web page: https://en.wikipedia.org/wiki/Canada ✓
Saving EPUB...
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/125px-Flag_of_Canada_%28Pantone%29.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/en/thumb/4/4f/Coat_of_arms_of_Canada.svg/85px-Coat_of_arms_of_Canada.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/6/67/CAN_orthographic.svg/220px-CAN_orthographic.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/Decrease_Positive.svg/11px-Decrease_Positive.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Nouvelle-France_map-en.svg/260px-Nouvelle-France_map-en.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Canada_WWI_l%27Emprunt_de_la_Victoire2.jpg/135px-Canada_WWI_l%27Emprunt_de_la_Victoire2.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Canada_WWI_Victory_Bonds2.jpg/136px-Canada_WWI_Victory_Bonds2.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Canada_topo.jpg/260px-Canada_topo.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Canada_K%C3%B6ppen.svg/260px-Canada_K%C3%B6ppen.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Toronto_from_above_at_night.jpg/240px-Toronto_from_above_at_night.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/FTAs_with_Canada.svg/260px-FTAs_with_Canada.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/2/2d/STS-116_-_P5_Truss_hand-off_to_ISS_%28NASA_S116-E-05765%29.jpg/220px-STS-116_-_P5_Truss_hand-off_to_ISS_%28NASA_S116-E-05765%29.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/Censusdivisions-ethnic.png/240px-Censusdivisions-ethnic.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Statue_outside_Union_Station.jpg/170px-Statue_outside_Union_Station.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/CBC_Radio_Canada_Chevrolet_Express_02.jpg/220px-CBC_Radio_Canada_Chevrolet_Express_02.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/O-Canada-1908.pdf/page1-170px-O-Canada-1908.pdf.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Canada2010WinterOlympicsOTcelebration.jpg/220px-Canada2010WinterOlympicsOTcelebration.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Sound-icon.svg/45px-Sound-icon.svg.png
1141364 total bytes, archive closed
Saved EPUB: /home/_Perco-Epubs/-Canada.epub
danburzo commented 1 year ago

Thanks @WolfgangDpunkt for the report, the issue should be fixed in version 2.2.1

WolfgangDpunkt commented 1 year ago

Thank you very much! I have completed the update and progress is noticeable. Indeed, it now works with the example article "Canada" from the English Wikipedia. But, I'm afraid, the problem is not yet completely solved.

If you can find the patience to work on this problem further, I would be happy. Since there are hardly any other reliable tools to convert wiki articles to epub books via command line, I think the bug has a high relevance.

In this article, for example, almost all the pictures are missing: https://de.wikipedia.org/wiki/Wien

However, there does not seem to be a fundamental problem with international language versions of Wikipedia. Because in the English article version "Vienna" there are a lot of pictures included in the epub, but not all of them: https://en.wikipedia.org/wiki/Vienna#Culinary_specialities

The photo "Sachertorte" is missing in the epub, for example:

Bildschirmfoto 2022-10-17 um 09 23 52

In fact, the debug log does not mention the filename of this photo either, for whatever reason this photo is ignored during the download (https://upload.wikimedia.org/wikipedia/commons/b/b8/Sachertorte_DSC03027.JPG)

danburzo commented 1 year ago

Thanks for pointing out the broken pages, it will help out with debugging. This is mostly Readability removing the images, I will investigate how to prevent that from happening.

danburzo commented 1 year ago

Seems that the HTML markup for images in Wikipedia is going to change soon: https://diff.wikimedia.org/2022/11/28/tech-news-2022-48/ (via @simevidas), so that may make handling them a bit easier.

danburzo commented 1 year ago

It turns out that there was more than one issue at play preventing one image or the other from being properly fetched/bundled:

There may be additional issues with Readability as mentioned in earlier comments, but I'm confident upgrading to percollate@4.0.2 will fix a lot of Wikipedia images.

WolfgangDpunkt commented 1 year ago

Dear @danburzo and @vongrad ,

I am very grateful for your work and your attention to my questions. This will help me a lot. I will test the new version as soon as possible. I appreciate your dedication very much. Kudos for how patiently you troubleshot this issue. Thank you very much!