dentarg / pynik

:tiger: Internet Relay Chat bot
Other
3 stars 2 forks source link

åäö lost in title #12

Closed dentarg closed 8 years ago

dentarg commented 8 years ago
21:18:50 <@dentarg> http://www.dn.se/nyheter/varlden/pensionarer-planerade-stora-juvelstoten-pa-puben/  
21:18:50 < rufwebot> Pension?rer planerade stora juvelst?ten p? puben - DN.SE
dentarg commented 8 years ago

From view-source:http://www.dn.se/nyheter/varlden/pensionarer-planerade-stora-juvelstoten-pa-puben/

<title>

        Pension&#228;rer planerade stora juvelst&#246;ten p&#229; puben - DN.SE
</title>
dentarg commented 8 years ago

what happens in the current solution:

>>> print unichr(228).encode('ascii', 'replace')
?

what we want to happen (utf-8 instead of ascii needed to handle 228)

>>> print unichr(228).encode('utf-8', 'replace')
ä

http://stackoverflow.com/questions/2087370/decode-html-entities-in-python-string

>>> import HTMLParser
>>> h = HTMLParser.HTMLParser()
>>> print h.unescape('Pension&#228;rer planerade stora juvelst&#246;ten p&#229; puben - DN.SE')
Pensionärer planerade stora juvelstöten på puben - DN.SE