ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
545 stars 77 forks source link

[Errno 1] Operation not permitted #449

Open makew0rld opened 4 years ago

makew0rld commented 4 years ago

What I wanted: An available HTTP resource to be downloaded, as can be done in the browser.

What I expect: The resource will be downloaded.

What happened: This following shows up repeatedly in logs, with many different URLs. This hasn't been an issue for this whole archive, but I'm confused as to what's making it happen now.

2020-04-02 17:59:38,684 - wpull.processor.web - INFO - Fetching ‘https://cdn.mos.cms.futurecdn.net/KhPft6889LsFNgFMAF2Hj3-650-80.jpg’.
2020-04-02 17:59:39,820 - wpull.processor.base - ERROR - Fetching ‘https://cdn.mos.cms.futurecdn.net/KhPft6889LsFNgFMAF2Hj3-650-80.jpg’ encountered an error: [Errno 1] Operation not permitted

This URL, and others that caused the error, can easily be accessed in the browser.

The command or website causes the problem: (Copy the options provided to Wpull here) wpull options:

--warc-file freevintageillustrations --warc-cdx --warc-append --warc-max-size 2147483648 \
--no-check-certificate \
--no-robots --user-agent "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36" \
--wait 1 --random-wait --waitretry 600 \
--page-requisites --recursive --level inf --sitemaps \
--span-hosts-allow linked-pages,page-requisites \
--escaped-fragment --strip-session-id \
--tries 3 --retry-connrefused --retry-dns-error \
--timeout 60 \
--database database.db \
-o wpull.log -nv

Operating system: Debian Linux

Python version: 3.5.9

Wpull version: 2.0.3 - from the tag, not master.

Log/Output: I provided some example log output above. Here's a link to all the Operation not permitted errors I've had so far, for some variety of domains.

JustAnotherArchivist commented 4 years ago

This is a TLS issue. Specifically, the site uses weak cipher suites which are no longer permitted by Debian's default OpenSSL config file. Cf. https://github.com/ArchiveTeam/ArchiveBot/issues/424.

wpull should override those defaults from the config file, at least if --no-strong-crypto is used. Until that's implemented, you can use the workaround in https://github.com/ArchiveTeam/ArchiveBot/pull/428 of specifying an alternative, less secure OpenSSL config file through an environment variable.