danburzo / percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
MIT License
4.2k stars 166 forks source link

Web pages cannot correctly identify and download image links. #174

Open jinshuqishi2019 opened 1 month ago

jinshuqishi2019 commented 1 month ago



Thank you for answering my question. Images requiring a Referer header are not fetched

Another minor issue mentioned in this question has not yet been resolved.

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate html --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ - --inline
chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate epub --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ - 

EPUB and HTML cannot download images; even when using the --inline parameter, HTML still displays image URLs, whereas PDF can correctly display images.


jinshuqishi2019 commented 1 month ago

I am currently using a non-conventional method to solve the problem of downloading images, which involves using the sed command to modify the HTML image links into image formats (such as png, etc.).

I tried asking ChatGPT how to solve this problem, and it suggested installing Cheerio library to get the image links. For images without a suffix, it also recommended using the mime-types library to obtain the MIME type from the response headers to determine the file extension.

danburzo commented 1 month ago

Thanks for the report, it seems that something is not hooked up correctly when the HTML content comes via the standard input. Will investigate!

jinshuqishi2019 commented 1 month ago

I have a question: how can I modify the following command to bundle multiple web pages into one EPUB?Thanks!

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw|sed -e 's/\(\?wx_fmt=png\)[^"]*/.png/gI' -e 's/\(\?wx_fmt=jpe\?g\)[^"]*/.jpg/gI' -e 's/\(\?wx_fmt=gif\)[^"]*/.gif/gI'| percollate epub https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw -