UnicodeEncodeError: surrogates not allowed with BeautifulSoup

romainhk commented 3 years ago

I've got a weird crash when I try to convert into pdf an html text containing a smiley as html entity, and previously modified by BeautifulSoup (what a strange use case, you might say :)

Traceback (most recent call last):
  File "weasyprint_blushed.py", line 15, in <module>
    section.write_pdf('output.pdf')
  File "...weasyprint/__init__.py", line 180, in write_pdf
    self.render(
  File "...weasyprint/__init__.py", line 134, in render
    return Document._render(
  File "...weasyprint/document.py", line 887, in _render
    [Page(page_box) for page_box in page_boxes],
  File "...weasyprint/document.py", line 887, in <listcomp>
    [Page(page_box) for page_box in page_boxes],
  File "...weasyprint/layout/__init__.py", line 124, in layout_document
    pages = list(make_all_pages(context, root_box, html, pages))
  File "...weasyprint/layout/pages.py", line 802, in make_all_pages
    page, resume_at = remake_page(i, context, root_box, html)
  File "...weasyprint/layout/pages.py", line 739, in remake_page
    page, resume_at, next_page = make_page(
  File "...weasyprint/layout/pages.py", line 549, in make_page
    root_box, resume_at, next_page, _, _ = block_level_layout(
  File "...weasyprint/layout/blocks.py", line 58, in block_level_layout
    return block_level_layout_switch(
  File "...weasyprint/layout/blocks.py", line 72, in block_level_layout_switch
    return block_box_layout(
  File "...weasyprint/layout/blocks.py", line 126, in block_box_layout
    block_container_layout(
  File "...weasyprint/layout/blocks.py", line 517, in block_container_layout
    collapsing_through) = block_level_layout(
  File "...weasyprint/layout/blocks.py", line 58, in block_level_layout
    return block_level_layout_switch(
  File "...weasyprint/layout/blocks.py", line 72, in block_level_layout_switch
    return block_box_layout(
  File "...weasyprint/layout/blocks.py", line 126, in block_box_layout
    block_container_layout(
  File "...weasyprint/layout/blocks.py", line 517, in block_container_layout
    collapsing_through) = block_level_layout(
  File "...weasyprint/layout/blocks.py", line 58, in block_level_layout
    return block_level_layout_switch(
  File "...weasyprint/layout/blocks.py", line 72, in block_level_layout_switch
    return block_box_layout(
  File "...weasyprint/layout/blocks.py", line 126, in block_box_layout
    block_container_layout(
  File "...weasyprint/layout/blocks.py", line 379, in block_container_layout
    for i, (line, resume_at) in enumerate(lines_iterator):
  File "...weasyprint/layout/inlines.py", line 48, in iter_line_boxes
    line, resume_at = get_next_linebox(
  File "...weasyprint/layout/inlines.py", line 176, in get_next_linebox
    last_letter, float_width) = split_inline_box(
  File "...weasyprint/layout/inlines.py", line 842, in split_inline_box
    split_inline_level(
  File "...weasyprint/layout/inlines.py", line 700, in split_inline_level
    last_letter, float_widths) = split_inline_box(
  File "...weasyprint/layout/inlines.py", line 842, in split_inline_box
    split_inline_level(
  File "...weasyprint/layout/inlines.py", line 679, in split_inline_level
    new_box, skip, preserved_line_break = split_text_box(
  File "...weasyprint/layout/inlines.py", line 1100, in split_text_box
    layout, length, resume_at, width, height, baseline = split_first_line(
  File "...weasyprint/text/line_break.py", line 341, in split_first_line
    layout = create_layout(
  File "...weasyprint/text/line_break.py", line 287, in create_layout
    layout.set_text(text)
  File "...weasyprint/text/line_break.py", line 200, in set_text
    text, bytestring = unicode_to_char_p(text)
  File "...weasyprint/text/ffi.py", line 408, in unicode_to_char_p
    bytestring = string.encode('utf-8').replace(b'\x00', b'')
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed

Here is script to reproduce that :

from weasyprint import HTML
from bs4 import BeautifulSoup

html = f'''
<body>
<p class="normal">I'm blushed
<span>&#55357;&#56842;</span>
</p>
</body>
'''
soup = BeautifulSoup(html, 'html.parser')
section_ = soup.body.decode_contents(formatter="html")

# Easy workaround
# section_ = section_.replace('\ud83d\ude0a', ':)')

section = HTML(string=section_)  # but it will works if you use html var
section.write_pdf('output.pdf')

Versions :

python3: 3.9.6
beautifulsoup4: 4.9.3
weasyprint: 53.2

Don't know if the issue belongs here or in bs4, but at least, nobody has never had that :) Maybe a try/except should be useful on that line.

liZe commented 3 years ago

Hello, and thanks for the report!

This bug is not related to WeasyPrint. Calling print(soup) after soup = … crashes too.

The two character numbers you use for your entity are surrogates, and can’t be used in valid UTF-8 strings. Browsers (and WeasyPrint when you use the html variable directly) can find that the UTF-8 string is invalid, and nicely replace the characters with a Replacement Character (�). BeautifulSoup doesn’t do that.

romainhk commented 3 years ago

Oh, I've forgot to answer :) You are totally right, its not related to WeasyPrint in fact. I will see to escalate that to Beautiful Soup directly. I've handle a lot a encoding problems in past few years, but its the first time I see surrogate ones. We never stop to learn :)

ddelange commented 1 year ago

Hi @romainhk 👋 did you end up reporting this upstream?

Kozea / WeasyPrint

UnicodeEncodeError: surrogates not allowed with BeautifulSoup #1438