Kozea / WeasyPrint

The awesome document factory
https://weasyprint.org
BSD 3-Clause "New" or "Revised" License
7.15k stars 680 forks source link

default_url_fetcher can't fetch certain resources #2261

Closed jpco closed 2 weeks ago

jpco commented 2 weeks ago

Full reproduction script, including workaround:

import io
import sys

import requests
import logging
from weasyprint import default_url_fetcher, HTML
from weasyprint.text.fonts import FontConfiguration

url = 'https://www.hcn.org/issues/56-10/how-do-you-describe-a-sacred-site-without-describing-it/'

def fallback_fetcher(url):
    try:
        return default_url_fetcher(url)
    except:
        r = requests.get(url)
        r.raw.decode_content = True
        byts = io.BytesIO(r.content)
        return {'string': byts.read()}

logger = logging.getLogger('weasyprint')
logger.setLevel(logging.ERROR)
logger.addHandler(logging.StreamHandler(sys.stderr))

h = HTML(url)
fc = FontConfiguration()
h.write_pdf(
    "demo.pdf",
    font_config=fc,
)

When running this script, I get

$ python demo.py
Failed to load image at 'https://i0.wp.com/www.hcn.org/wp-content/uploads/2023/04/HCN_Logo-Horizontal_White-1.png?fit=1763%2C253&ssl=1': HTTPError: HTTP Error 404: Not Found
Failed to load image at 'https://i0.wp.com/www.hcn.org/wp-content/uploads/2024/09/sacred-site-56-10_1-scaled.jpg?resize=780%2C1089&ssl=1': HTTPError: HTTP Error 404: Not Found

and these images are (naturally) not included in the output PDF.

However, when setting url_fetcher=fallback_fetcher, these images load perfectly well via the requests.get() call and are included in the PDF.

I wasn't able to determine what about the urllib library was causing the issues with fetching these images, but when testing I did determine that it wasn't the query string (the 404s occur with the same links with no query strings) and it wasn't the User-Agent header (the 404s occur when setting a different user agent, including the exact one that curl -- which works -- uses). It also doesn't seem to be anything that the default_url_fetcher is doing wrong, exactly -- using urllib in the most straightforward way also gets these 404 errors.

I have a hacky workaround included here, but this seems like something that shouldn't need a custom URL fetcher to work, so I'm filing it as an issue.

liZe commented 2 weeks ago

Hi!

Thanks for the detailed report.

You’re right, the problem happens also with Python’s urlopen:

from urllib.request import urlopen
urlopen('https://i0.wp.com/www.hcn.org/wp-content/uploads/2023/04/HCN_Logo-Horizontal_White-1.png')

This code fails with the same 404 error. Strange, but definitely not a bug in WeasyPrint’s code.

I’ve tried hard to find another equivalent problem in Python, but the only one I’ve found is Patrick-Hogan/wandering_inn#20, happening with the exact same i0.wp.com server. I suspect that there’s a very specific configuration on that server that makes urllib fail for some reason, or that there’s an actual bug in Python. This works:

import ssl
from urllib.request import urlopen
urlopen('https://i0.wp.com/www.hcn.org/wp-content/uploads/2023/04/HCN_Logo-Horizontal_White-1.png', context=ssl.create_default_context())

We won’t include a workaround in WeasyPrint for that, as the problem appears with pure urllib code too, and only with this specific server (as far as I can tell). But if you want to report the problem on CPython’s bug tracker, don’t hesitate to leave a link here, I’m curious about the reason why this 404 appears!