danburzo / percollate

A command-line tool to turn web pages into readable PDF, EPUB, HTML, or Markdown docs.
https://danburzo.ro/projects/percollate/
MIT License
4.32k stars 166 forks source link

request to [url] failed, reason: connect ETIMEDOUT #136

Closed AlexYuan closed 2 years ago

AlexYuan commented 2 years ago

Environment

Description

when I exec this cmd line:

percollate pdf --output some1.pdf https://www.nytimes.com/2022/04/27/us/politics/ukraine-war-expansion.html

(or url: https://www.washingtonpost.com/politics/2022/02/28/russia-ukraine-logistics-invasion/) I got this error info :

Fetching: https://www.nytimes.com/2022/04/27/us/politics/ukraine-war-expansion.htmlhttps://www.nytimes.com/2022/04/27/us/politics/ukraine-war-expansion.html: request to https://www.nytimes.com/2022/04/27/us/politics/ukraine-war-expansion.html failed, reason: connect ETIMEDOUT 198.27.124.186:443 FetchError: request to https://www.nytimes.com/2022/04/27/us/politics/ukraine-war-expansion.html failed, reason: connect ETIMEDOUT 198.27.124.186:443 at ClientRequest. (/usr/local/lib/node_modules/percollate/node_modules/node-fetch/lib/index.js:1491:11) at ClientRequest.emit (events.js:400:28) at TLSSocket.socketErrorListener (_http_client.js:475:9) at TLSSocket.emit (events.js:400:28) at emitErrorNT (internal/streams/destroy.js:106:8) at emitErrorCloseNT (internal/streams/destroy.js:74:3) at processTicksAndRejections (internal/process/task_queues.js:82:21) { type: 'system', errno: 'ETIMEDOUT', code: 'ETIMEDOUT' } Ignoring item Saved PDF: some1.pdf

but,puppeteer can load these url and output pdf.

danburzo commented 2 years ago

Hi @AlexYuan, thanks for the report. Fetching URLs may occasionally fail for one reason or another. For the NYT article, I can't personally reproduce the issue, while the WaPo article is redirecting in an infinite loop. Adding the --debug flag to the command may offer additional clues to what's going on.

In any case, when percollate is unable to fetch an URL, you can fetch it externally and pass it to percollate on STDIN, like the example below:

curl https://www.nytimes.com/2022/04/27/us/politics/ukraine-war-expansion.html | percollate pdf --output=some1.pdf - --url=https://www.nytimes.com/2022/04/27/us/politics/ukraine-war-expansion.html

Notice that when we use STDIN (via the - operand), we provide the web page's original URL with the --url option.

AlexYuan commented 2 years ago

thanks for your reply. I can visit the webpage article of NYT in my chrome(edge) brower through VPN , so,maybe we need add an arg like '--proxy' in percollate command line? or an arg like '--timeout' (>30*1000ms)?

danburzo commented 2 years ago

The request for proxy options has come up before. In order to keep percollate's focus narrow, currently any URL fetch that needs a custom setup is preferably fetched outside percollate. (See https://github.com/danburzo/percollate/issues/23)