Web pages cannot correctly identify and download image links.

jinshuqishi2019 commented 1 month ago

Environment

Operating System: debian 10
node --version: v20.12.2
npm --version: 10.7.0
yarn --version, if using Yarn:
percollate --version: v4.2.1

Description

Thank you for answering my question. Images requiring a Referer header are not fetched

Another minor issue mentioned in this question has not yet been resolved.

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate html --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ - --inline

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ | percollate epub --url https://mp.weixin.qq.com/s/Yr9naXOZSPHlR3XDHVJ8tQ -

EPUB and HTML cannot download images; even when using the --inline parameter, HTML still displays image URLs, whereas PDF can correctly display images.

Thanks.

jinshuqishi2019 commented 1 month ago

I am currently using a non-conventional method to solve the problem of downloading images, which involves using the sed command to modify the HTML image links into image formats (such as png, etc.).

I tried asking ChatGPT how to solve this problem, and it suggested installing Cheerio library to get the image links. For images without a suffix, it also recommended using the mime-types library to obtain the MIME type from the response headers to determine the file extension.

danburzo commented 1 month ago

Thanks for the report, it seems that something is not hooked up correctly when the HTML content comes via the standard input. Will investigate!

jinshuqishi2019 commented 1 month ago

I have a question: how can I modify the following command to bundle multiple web pages into one EPUB?Thanks!

chromium --headless --incognito --dump-dom  https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw|sed -e 's/\(\?wx_fmt=png\)[^"]*/.png/gI' -e 's/\(\?wx_fmt=jpe\?g\)[^"]*/.jpg/gI' -e 's/\(\?wx_fmt=gif\)[^"]*/.gif/gI'| percollate epub https://mp.weixin.qq.com/s/B_-_5k0JMiDoWWTAv7aSbw -

danburzo / percollate

Web pages cannot correctly identify and download image links. #174

Environment

Description