Missing images in Wikipedia articles

WolfgangDpunkt commented 1 year ago

Environment

Operating System: debian (aarch64)
node --version: v17.9.0
npm --version: 8.18.0
yarn --version, if using Yarn:
percollate --version: v2.2.0

Description

When I convert Wikipedia articles to epubs with this otherwise great and very useful tool, some of the images get lost. An adblocker is not used in this environment.

Here is my command line percollate epub --individual --output /home/Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug

And here is the resulting epub. I had to zip it, as Github does not accept epub files: -Canada.epub.zip

And here's the direct comparison, in the "British North America" section the web version has two images, the epub version zero.

There are indeed images in the epub, percollate does not ignore all images, but most of them. What could be the reason? Thanks a lot!

Here comes the debug log:

~# percollate epub --individual --output /home/_Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug
{
  command: 'epub',
  operands: [ 'https://en.wikipedia.org/wiki/Canada' ],
  opts: {
    individual: true,
    output: '/home/_Perco-Epubs/',
    debug: true
  }
}
Fetching: https://en.wikipedia.org/wiki/Canada ✓
Enhancing web page: https://en.wikipedia.org/wiki/Canada ✓
Saving EPUB...
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d9/Flag_of_Canada_%28Pantone%29.svg/125px-Flag_of_Canada_%28Pantone%29.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/en/thumb/4/4f/Coat_of_arms_of_Canada.svg/85px-Coat_of_arms_of_Canada.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/6/67/CAN_orthographic.svg/220px-CAN_orthographic.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Increase2.svg/11px-Increase2.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/9/92/Decrease_Positive.svg/11px-Decrease_Positive.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/b0/Nouvelle-France_map-en.svg/260px-Nouvelle-France_map-en.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/3/31/Canada_WWI_l%27Emprunt_de_la_Victoire2.jpg/135px-Canada_WWI_l%27Emprunt_de_la_Victoire2.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bf/Canada_WWI_Victory_Bonds2.jpg/136px-Canada_WWI_Victory_Bonds2.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Canada_topo.jpg/260px-Canada_topo.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/1/10/Canada_K%C3%B6ppen.svg/260px-Canada_K%C3%B6ppen.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/b/bd/Toronto_from_above_at_night.jpg/240px-Toronto_from_above_at_night.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/4/43/FTAs_with_Canada.svg/260px-FTAs_with_Canada.svg.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/2/2d/STS-116_-_P5_Truss_hand-off_to_ISS_%28NASA_S116-E-05765%29.jpg/220px-STS-116_-_P5_Truss_hand-off_to_ISS_%28NASA_S116-E-05765%29.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/7/7d/Censusdivisions-ethnic.png/240px-Censusdivisions-ethnic.png
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e0/Statue_outside_Union_Station.jpg/170px-Statue_outside_Union_Station.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/e/e4/CBC_Radio_Canada_Chevrolet_Express_02.jpg/220px-CBC_Radio_Canada_Chevrolet_Express_02.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/8/86/O-Canada-1908.pdf/page1-170px-O-Canada-1908.pdf.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/2/26/Canada2010WinterOlympicsOTcelebration.jpg/220px-Canada2010WinterOlympicsOTcelebration.jpg
Fetching: https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/Sound-icon.svg/45px-Sound-icon.svg.png
1141364 total bytes, archive closed
Saved EPUB: /home/_Perco-Epubs/-Canada.epub

danburzo commented 1 year ago

Thanks @WolfgangDpunkt for the report, the issue should be fixed in version 2.2.1

WolfgangDpunkt commented 1 year ago

Thank you very much! I have completed the update and progress is noticeable. Indeed, it now works with the example article "Canada" from the English Wikipedia. But, I'm afraid, the problem is not yet completely solved.

If you can find the patience to work on this problem further, I would be happy. Since there are hardly any other reliable tools to convert wiki articles to epub books via command line, I think the bug has a high relevance.

In this article, for example, almost all the pictures are missing: https://de.wikipedia.org/wiki/Wien

However, there does not seem to be a fundamental problem with international language versions of Wikipedia. Because in the English article version "Vienna" there are a lot of pictures included in the epub, but not all of them: https://en.wikipedia.org/wiki/Vienna#Culinary_specialities

The photo "Sachertorte" is missing in the epub, for example:

In fact, the debug log does not mention the filename of this photo either, for whatever reason this photo is ignored during the download (https://upload.wikimedia.org/wikipedia/commons/b/b8/Sachertorte_DSC03027.JPG)

danburzo commented 1 year ago

Thanks for pointing out the broken pages, it will help out with debugging. This is mostly Readability removing the images, I will investigate how to prevent that from happening.

danburzo commented 1 year ago

Seems that the HTML markup for images in Wikipedia is going to change soon: https://diff.wikimedia.org/2022/11/28/tech-news-2022-48/ (via @simevidas), so that may make handling them a bit easier.

danburzo commented 1 year ago

It turns out that there was more than one issue at play preventing one image or the other from being properly fetched/bundled:

on non-English Wikipedia pages, URLs pointing to what look like images but are in fact HTML pages were not excluded, due the assumption they'd match wiki/File:. In fact, the File: part of the URL is localized, so you could have Fișier: or Datei:. Thanks @vongrad for investigating and submitting a patch!
additionally, regexes for matching image URLs were scattered in the codebase, and one of them was unintentionally case-sensitive, meaning it didn't match upercase filenames such as Sachertorte_DSC03027.JPG.

There may be additional issues with Readability as mentioned in earlier comments, but I'm confident upgrading to percollate@4.0.2 will fix a lot of Wikipedia images.

WolfgangDpunkt commented 1 year ago

Dear @danburzo and @vongrad ,

I am very grateful for your work and your attention to my questions. This will help me a lot. I will test the new version as soon as possible. I appreciate your dedication very much. Kudos for how patiently you troubleshot this issue. Thank you very much!

danburzo / percollate

Missing images in Wikipedia articles #141

Environment

Description