danburzo / percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
https://danburzo.ro/projects/percollate/
MIT License
4.2k stars 166 forks source link

Web pages cannot correctly identify and download image links. #174

Open jinshuqishi2019 opened 1 month ago

jinshuqishi2019 commented 1 month ago

Environment

Description

Thank you for answering my question. Images requiring a Referer header are not fetched

Another minor issue mentioned in this question has not yet been resolved.

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate html --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ - --inline
chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate epub --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ - 

EPUB and HTML cannot download images; even when using the --inline parameter, HTML still displays image URLs, whereas PDF can correctly display images.

Thanks.

jinshuqishi2019 commented 1 month ago

I am currently using a non-conventional method to solve the problem of downloading images, which involves using the sed command to modify the HTML image links into image formats (such as png, etc.).

I tried asking ChatGPT how to solve this problem, and it suggested installing Cheerio library to get the image links. For images without a suffix, it also recommended using the mime-types library to obtain the MIME type from the response headers to determine the file extension.

danburzo commented 1 month ago

Thanks for the report, it seems that something is not hooked up correctly when the HTML content comes via the standard input. Will investigate!

jinshuqishi2019 commented 1 month ago

I have a question: how can I modify the following command to bundle multiple web pages into one EPUB?Thanks!

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw|sed -e 's/\(\?wx_fmt=png\)[^"]*/.png/gI' -e 's/\(\?wx_fmt=jpe\?g\)[^"]*/.jpg/gI' -e 's/\(\?wx_fmt=gif\)[^"]*/.gif/gI'| percollate epub https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw -