ArchiveBox / ArchiveBox

πŸ—ƒ Open source self-hosted web archiving. Takes URLs/browser history/bookmarks/Pocket/Pinboard/etc., saves HTML, JS, PDFs, media, and more...
MIT License
20.88k stars 1.11k forks source link

Bug: Cannot archive jpg at non-80 port 443 #1210

Closed mlsteele closed 8 months ago

mlsteele commented 1 year ago

Describe the bug

When trying to archive this url it repeatedly fails. .

Two weird things about this url:

  1. It is a jpg, not an html page. I think this is not the issue though wget_output_path does have html-specific elements.
  2. It is at port 443.

Steps to reproduce

  1. Run archivebox add
  2. Find the timestamp on the webui by searching for the url.
  3. Run archivebox update -t timestamp 1691792281.185177 (your timestamp will vary)
  4. See wget failure and web ui lacking the content

Screenshots or log output

[i] [2023-08-11 23:09:20] ArchiveBox v0.6.2: archivebox update -t timestamp 1691792281.185177
    > /Users/miles/archivebox

[β–Ά] [2023-08-11 23:09:22] Starting archiving of 1 snapshots in index...

[√] [2023-08-11 23:09:22] ""
    √ ./archive/1691792281.185177
      > wget
        Extractor failed:
             Wget failed or got an error from the server
            Got wget response code: 0.
            Total wall clock time: 0.3s
            Downloaded: 1 files, 149K in 0.05s (2.67 MB/s)
        Run to see full output:
            cd /Users/miles/archivebox/archive/1691792281.185177;
            wget --no-verbose --adjust-extension --convert-links --force-directories --backup-converted --span-hosts --no-parent -e robots=off --timeout=60 --restrict-file-names=windows --warc-file=/Users/miles/archivebox/archive/1691792281.185177/warc/1691795362 --page-requisites "--user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.61 Safari/537.36 ArchiveBox/0.6.2 (+ wget/GNU Wget 1.21.4" --compression=auto

        10 files (1.1 MB) in 0:00:00s 

[√] [2023-08-11 23:09:23] Update of 1 pages complete (0.72 sec)
    - 0 links skipped
    - 1 links updated
    - 1 links had errors

    Hint: To manage your archive in a Web UI, run:
        archivebox server

Possible fix

The issue is that in in wget_output_path this line:

    search_dir = Path(link.link_dir) / domain(link.url).replace(":", "+") / urldecode(full_path)

translates "" into "". And then goes on a wild goose chase looking for the file. Whereas the real wget did not do that translation and instead threw out the port number when placing the jpg in the filesystem.

A localized solution is to use urlparse(link.url).hostname instead of domain(link.url).

A broader solution, which may do more good or break something far away, is to change in

- domain = lambda url: urlparse(url).netloc
+ domain = lambda url: urlparse(url).hostname

ArchiveBox version

ArchiveBox v0.6.2
Cpython Darwin macOS-13.5-x86_64-i386-64bit x86_64

[i] Dependency versions:
 √  ARCHIVEBOX_BINARY     v0.6.2          valid     /Users/miles/.pyenv/versions/3.11.1/bin/archivebox                          
 √  PYTHON_BINARY         v3.11.1         valid     /Users/miles/.pyenv/versions/3.11.1/bin/python3.11                          
 √  DJANGO_BINARY         v3.1.14         valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/django/bin/
 √  CURL_BINARY           v8.1.2          valid     /usr/bin/curl                                                               
 √  WGET_BINARY           v1.21.4         valid     /usr/local/bin/wget                                                         
 √  NODE_BINARY           v18.16.0        valid     /opt/nodejs/bin/node                                                        
 √  SINGLEFILE_BINARY     v1.0.47         valid     ./node_modules/single-file/cli/single-file                                  
 √  READABILITY_BINARY    v0.0.6          valid     ./node_modules/readability-extractor/readability-extractor                  
 √  MERCURY_BINARY        v1.0.0          valid     ./node_modules/@postlight/mercury-parser/cli.js                             
 √  GIT_BINARY            v2.41.0         valid     /usr/local/bin/git                                                          
 √  YOUTUBEDL_BINARY      v2021.12.17     valid     /Users/miles/.pyenv/versions/3.11.1/bin/youtube-dl                          
 √  CHROME_BINARY         v115.0.5790.75  valid     /Users/miles/Library/Caches/ms-playwright/chromium-1071/chrome-mac/
 √  RIPGREP_BINARY        v13.0.0         valid     /usr/local/bin/rg                                                           

[i] Source-code locations:
 √  PACKAGE_DIR           23 files        valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/archivebox 
 √  TEMPLATES_DIR         3 files         valid     /Users/miles/.pyenv/versions/3.11.1/lib/python3.11/site-packages/archivebox/templates
 -  CUSTOM_TEMPLATES_DIR  -               disabled                                                                              

[i] Secrets locations:
 -  CHROME_USER_DATA_DIR  -               disabled                                                                              
 -  COOKIES_FILE          -               disabled                                                                              

[i] Data locations:
 √  OUTPUT_DIR            8 files         valid     /Users/miles/archivebox                                                     
 √  SOURCES_DIR           10 files        valid     ./sources                                                                   
 √  LOGS_DIR              1 files         valid     ./logs                                                                      
 √  ARCHIVE_DIR           40 files        valid     ./archive                                                                   
 √  CONFIG_FILE           222.0 Bytes     valid     ./ArchiveBox.conf                                                           
 √  SQL_INDEX             616.0 KB        valid     ./index.sqlite3                                                             
pirate commented 8 months ago

I suspect that wget stripping the port for the URL you provided is just a special case of it being both https and :443.

I agree that your approach using urlparse is better, but the existing behavior in wget_output_path is written in blood haha. There are so many subtle edge cases to how they rewrite URLs to be suitable for the filesystem, I really just wish they printed the final path in stdout! I'm going to err on the safe side and add it as a a fallback instead of modifying the first-pass behavior.

Fixed in 99bb02cd6c58c991d5298cd8df95888f09ef1bdf, will be released in v0.7.3.