Closed romainhk closed 3 years ago
Hello, and thanks for the report!
This bug is not related to WeasyPrint. Calling print(soup)
after soup = …
crashes too.
The two character numbers you use for your entity are surrogates, and can’t be used in valid UTF-8 strings. Browsers (and WeasyPrint when you use the html
variable directly) can find that the UTF-8 string is invalid, and nicely replace the characters with a Replacement Character (�). BeautifulSoup doesn’t do that.
Oh, I've forgot to answer :) You are totally right, its not related to WeasyPrint in fact. I will see to escalate that to Beautiful Soup directly. I've handle a lot a encoding problems in past few years, but its the first time I see surrogate ones. We never stop to learn :)
I've got a weird crash when I try to convert into pdf an html text containing a smiley as html entity, and previously modified by BeautifulSoup (what a strange use case, you might say :)
Here is script to reproduce that :
Versions :
Don't know if the issue belongs here or in bs4, but at least, nobody has never had that :) Maybe a try/except should be useful on that line.