Open WolfgangDpunkt opened 1 year ago
Thanks @WolfgangDpunkt for the report, the issue should be fixed in version 2.2.1
Thank you very much! I have completed the update and progress is noticeable. Indeed, it now works with the example article "Canada" from the English Wikipedia. But, I'm afraid, the problem is not yet completely solved.
If you can find the patience to work on this problem further, I would be happy. Since there are hardly any other reliable tools to convert wiki articles to epub books via command line, I think the bug has a high relevance.
In this article, for example, almost all the pictures are missing: https://de.wikipedia.org/wiki/Wien
However, there does not seem to be a fundamental problem with international language versions of Wikipedia. Because in the English article version "Vienna" there are a lot of pictures included in the epub, but not all of them: https://en.wikipedia.org/wiki/Vienna#Culinary_specialities
The photo "Sachertorte" is missing in the epub, for example:
In fact, the debug log does not mention the filename of this photo either, for whatever reason this photo is ignored during the download (https://upload.wikimedia.org/wikipedia/commons/b/b8/Sachertorte_DSC03027.JPG)
Thanks for pointing out the broken pages, it will help out with debugging. This is mostly Readability removing the images, I will investigate how to prevent that from happening.
Seems that the HTML markup for images in Wikipedia is going to change soon: https://diff.wikimedia.org/2022/11/28/tech-news-2022-48/ (via @simevidas), so that may make handling them a bit easier.
It turns out that there was more than one issue at play preventing one image or the other from being properly fetched/bundled:
wiki/File:
. In fact, the File:
part of the URL is localized, so you could have Fișier:
or Datei:
. Thanks @vongrad for investigating and submitting a patch!Sachertorte_DSC03027.JPG
.There may be additional issues with Readability as mentioned in earlier comments, but I'm confident upgrading to percollate@4.0.2
will fix a lot of Wikipedia images.
Dear @danburzo and @vongrad ,
I am very grateful for your work and your attention to my questions. This will help me a lot. I will test the new version as soon as possible. I appreciate your dedication very much. Kudos for how patiently you troubleshot this issue. Thank you very much!
Environment
node --version
: v17.9.0npm --version
: 8.18.0yarn --version
, if using Yarn:percollate --version
: v2.2.0Description
When I convert Wikipedia articles to epubs with this otherwise great and very useful tool, some of the images get lost. An adblocker is not used in this environment.
Here is my command line
percollate epub --individual --output /home/Perco-Epubs/ https://en.wikipedia.org/wiki/Canada --debug
And here is the resulting epub. I had to zip it, as Github does not accept epub files: -Canada.epub.zip
And here's the direct comparison, in the "British North America" section the web version has two images, the epub version zero.
There are indeed images in the epub, percollate does not ignore all images, but most of them. What could be the reason? Thanks a lot!
Here comes the debug log: