Closed jpco closed 2 weeks ago
Hi!
Thanks for the detailed report.
You’re right, the problem happens also with Python’s urlopen:
from urllib.request import urlopen
urlopen('https://i0.wp.com/www.hcn.org/wp-content/uploads/2023/04/HCN_Logo-Horizontal_White-1.png')
This code fails with the same 404 error. Strange, but definitely not a bug in WeasyPrint’s code.
I’ve tried hard to find another equivalent problem in Python, but the only one I’ve found is Patrick-Hogan/wandering_inn#20, happening with the exact same i0.wp.com
server. I suspect that there’s a very specific configuration on that server that makes urllib fail for some reason, or that there’s an actual bug in Python. This works:
import ssl
from urllib.request import urlopen
urlopen('https://i0.wp.com/www.hcn.org/wp-content/uploads/2023/04/HCN_Logo-Horizontal_White-1.png', context=ssl.create_default_context())
We won’t include a workaround in WeasyPrint for that, as the problem appears with pure urllib code too, and only with this specific server (as far as I can tell). But if you want to report the problem on CPython’s bug tracker, don’t hesitate to leave a link here, I’m curious about the reason why this 404 appears!
Full reproduction script, including workaround:
When running this script, I get
and these images are (naturally) not included in the output PDF.
However, when setting
url_fetcher=fallback_fetcher
, these images load perfectly well via therequests.get()
call and are included in the PDF.I wasn't able to determine what about the
urllib
library was causing the issues with fetching these images, but when testing I did determine that it wasn't the query string (the 404s occur with the same links with no query strings) and it wasn't theUser-Agent
header (the 404s occur when setting a different user agent, including the exact one thatcurl
-- which works -- uses). It also doesn't seem to be anything that thedefault_url_fetcher
is doing wrong, exactly -- usingurllib
in the most straightforward way also gets these 404 errors.I have a hacky workaround included here, but this seems like something that shouldn't need a custom URL fetcher to work, so I'm filing it as an issue.