Closed mlsteele closed 8 months ago
I suspect that wget stripping the port for the URL you provided is just a special case of it being both https
and :443
.
I agree that your approach using urlparse
is better, but the existing behavior in wget_output_path
is written in blood haha.
There are so many subtle edge cases to how they rewrite URLs to be suitable for the filesystem, I really just wish they printed the final path in stdout!
I'm going to err on the safe side and add it as a a fallback instead of modifying the first-pass behavior.
Fixed in 99bb02cd6c58c991d5298cd8df95888f09ef1bdf, will be released in v0.7.3
.
Describe the bug
When trying to archive this url it repeatedly fails. https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg .
Two weird things about this url:
Steps to reproduce
archivebox add https://newfs.s3.amazonaws.com:443/taxon-images-1000s1000/Fabaceae/gleditsia-triacanthos-ba-dstiles.jpg
archivebox update -t timestamp 1691792281.185177
(your timestamp will vary)Screenshots or log output
Possible fix
The issue is that in
wget.py
inwget_output_path
this line:translates "newfs.s3.amazonaws.com:443" into "newfs.s3.amazonaws.com+443". And then goes on a wild goose chase looking for the file. Whereas the real
wget
did not do that translation and instead threw out the port number when placing the jpg in the filesystem.A localized solution is to use
urlparse(link.url).hostname
instead ofdomain(link.url)
.A broader solution, which may do more good or break something far away, is to change in
util.py
:ArchiveBox version