ArchiveBox / ArchiveBox

πŸ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
https://archivebox.io
MIT License
20.88k stars 1.11k forks source link

Bug: Cannot archive jpg at non-80 port 443 #1210

Closed mlsteele closed 8 months ago

mlsteele commented 1 year ago

Describe the bug

When trying to archive this url it repeatedly fails. https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg .

Two weird things about this url:

  1. It is a jpg, not an html page. I think this is not the issue though wget_output_path does have html-specific elements.
  2. It is at port 443.

Steps to reproduce

  1. Run archivebox add https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg
  2. Find the timestamp on the webui by searching for the url.
  3. Run archivebox update -t timestamp 1691792281.185177 (your timestamp will vary)
  4. See wget failure and web ui lacking the content

Screenshots or log output

[i] [2023-08-11 23:09:20] ArchiveBox v0.6.2: archivebox update -t timestamp 1691792281.185177
    > /Users/miles/archivebox

[β–Ά] [2023-08-11 23:09:22] Starting archiving of 1 snapshots in index...

[√] [2023-08-11 23:09:22] "newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg"
    https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg
    √ ./archive/1691792281.185177
      > wget
        Extractor failed:
             Wget failed or got an error from the server
            Got wget response code: 0.
            Total wall clock time: 0.3s
            Downloaded: 1 files, 149K in 0.05s (2.67 MB/s)
        Run to see full output:
            cd /Users/miles/archivebox/archive/1691792281.185177;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=/Users/miles/archivebox/archive/1691792281.185177/warc/1691795362 --page-requisites "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+https://github.com/ArchiveBox/ArchiveBox/) wget/GNU Wget 1.21.4" --compression=auto https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg

        10 files (1.1 MB) in 0:00:00s 

[√] [2023-08-11 23:09:23] Update of 1 pages complete (0.72 sec)
    - 0 links skipped
    - 1 links updated
    - 1 links had errors

    Hint: To manage your archive in a Web UI, run:
        archivebox server 0.0.0.0:8000

Possible fix

The issue is that in wget.py in wget_output_path this line:

    search_dir = Path(link.link_dir) / domain(link.url).replace(":", "+") / urldecode(full_path)

translates "newfs.s3.amazonaws.com:443" into "newfs.s3.amazonaws.com+443". And then goes on a wild goose chase looking for the file. Whereas the real wget did not do that translation and instead threw out the port number when placing the jpg in the filesystem.

A localized solution is to use urlparse(link.url).hostname instead of domain(link.url).

A broader solution, which may do more good or break something far away, is to change in util.py:

- domain = lambda url: urlparse(url).netloc
+ domain = lambda url: urlparse(url).hostname

ArchiveBox version

ArchiveBox v0.6.2
Cpython Darwin macOS-13.5-x86_64-i386-64bit x86_64
IN_DOCKER=False DEBUG=False IS_TTY=False TZ=UTC SEARCH_BACKEND_ENGINE=ripgrep

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /Users/miles/.pyenv/versions/3.11.1/bin/archivebox                          
 √  PYTHON_BINARY         v3.11.1         valid     /Users/miles/.pyenv/versions/3.11.1/bin/python3.11                          
 √  DJANGO_BINARY         v3.1.14         valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/django/bin/django-admin.py
 √  CURL_BINARY           v8.1.2          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.4         valid     /usr/local/bin/wget                                                         
 √  NODE_BINARY           v18.16.0        valid     /opt/nodejs/bin/node                                                        
 √  SINGLEFILE_BINARY     v1.0.47         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.6          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.41.0         valid     /usr/local/bin/git                                                          
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /Users/miles/.pyenv/versions/3.11.1/bin/youtube-dl                          
 √  CHROME_BINARY         v115.0.5790.75  valid     /Users/miles/Library/Caches/ms-playwright/chromium-1071/chrome-mac/Chromium.app/Contents/MacOS/Chromium
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/local/bin/rg                                                           

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/archivebox 
 √  TEMPLATES_DIR         3 files         valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /Users/miles/archivebox                                                     
 √  SOURCES_DIR           10 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           40 files        valid     ./archive                                                                   
 √  CONFIG_FILE           222.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             616.0 KB        valid     ./index.sqlite3                                                             
pirate commented 8 months ago

I suspect that wget stripping the port for the URL you provided is just a special case of it being both https and :443.

I agree that your approach using urlparse is better, but the existing behavior in wget_output_path is written in blood haha. There are so many subtle edge cases to how they rewrite URLs to be suitable for the filesystem, I really just wish they printed the final path in stdout! I'm going to err on the safe side and add it as a a fallback instead of modifying the first-pass behavior.

Fixed in 99bb02cd6c58c991d5298cd8df95888f09ef1bdf, will be released in v0.7.3.