truncated utf-8 byte sequences force an encoding override

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?

1. # telnet www.arnaudmontebourg.fr 80
Trying 213.186.33.16...
Connected to www.arnaudmontebourg.fr.
Escape character is '^]'.
GET /?feed=rss2 HTTP/1.1
Host: www.arnaudmontebourg.fr 

Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
...
<?xml version="1.0" encoding="UTF-8"?>
...
<title>Retour d’Algérie : sortir des querelles du passé</title>
...

2.
Python 2.6.6 (r266:84292, Dec 26 2010, 22:31:48) 
[GCC 4.4.5] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import feedparser
>>> d = feedparser.parse("http://www.arnaudmontebourg.fr/?feed=rss2")
>>> print d
...
'title': u'Retour d\xe2\x80\x99Alg\u0102\u0160rie : sortir des querelles du 
pass\u0102\u0160'

3.
print u'Retour d\xe2\x80\x99Alg\u0102\u0160rie : sortir des querelles du 
pass\u0102\u0160'

What is the expected output? What do you see instead?
Expected output: Retour d’Algérie : sortir des querelles du passé
Instead: Retour dâAlgĂŠrie : sortir des querelles du passĂŠ

What version of the product are you using? On what operating system?
Debian GNU/Linux 6.0
Linux 2.6.32-5-amd64 #1 SMP Mon Oct 3 03:59:20 UTC 2011 x86_64 GNU/Linux
python-feedparser - Universal Feed Parser for Python 

Python 2.6.6

Original issue reported on code.google.com by pelat.la...@gmail.com on 26 Nov 2011 at 3:01

GoogleCodeExporter commented 9 years ago

The feed in question (a copy of which is attached for longevity) is hilariously 
malformed. The problem stems from the site encoding Unicode characters using 
UTF-8...and then truncating them mid-byte sequence. Consequently, the first few 
bytes of the multi-byte character are included, but the final byte is missing, 
resulting in binary garbage that causes feedparser to decode the entire feed as 
windows-1252.

Here's an example of the problem. In one of the original entries, the author 
uses the following character, which is encoded to a three byte sequence in 
UTF-8:

>>> u'’'.encode('utf-8')
'\xe2\x80\x99'

Doing a hexdump of the feed reveals that that character was truncated to the 
single byte '\xe2', which is garbage.

I don't believe that this is a problem that feedparser *has* to overcome, 
though it would be nice if it did. A nicer solution would be to report this as 
a bug in Wordpress 3.2.1 (which is the software powering the site). They're 
gearing up for their 3.3 release, so this would be an ideal opportunity to 
report the bug so it gets fixed!

Original comment by kurtmckee on 26 Nov 2011 at 5:40

Changed title: truncated utf-8 byte sequences force an encoding override

Attachments:

truncated-utf8.xml

GoogleCodeExporter commented 9 years ago

Thanks for your answer kurtmckee, I'm going to report it.

Original comment by pelat.la...@gmail.com on 26 Nov 2011 at 5:49

GoogleCodeExporter commented 9 years ago

I'm actually trying to do so myself, but I discovered they did a site-wide 
password reset earlier this year so I'm currently trying to get access again. 
It looks like this is a known problem that was fixed four years ago:

https://core.trac.wordpress.org/ticket/6077

I don't know if it's appropriate to reopen that ticket or open a new one.

Original comment by kurtmckee on 26 Nov 2011 at 5:52

GoogleCodeExporter commented 9 years ago

After better researching the issue I've created a new ticket:

https://core.trac.wordpress.org/ticket/19368

If this is a problem in Wordpress, hopefully it'll be resolved quickly to 
everyone's benefit! Meanwhile, I'm going to leave this report open until I have 
an opportunity to figure out if there's a good way to fix this in feedparser 
without b0rking existing support. My big concern is that trusting the declared 
encoding using code like

    '\xe2'.decode('utf-8', 'replace')

will fix this specific case but will break the case that the encoding is 
completely misdeclared.

Original comment by kurtmckee on 27 Nov 2011 at 6:05

GoogleCodeExporter commented 9 years ago

Thanks to have reported it! I have checked the Wordpress Ticket! As I'm not the 
owner of the site, I can't help them to identify which plugin is affected by 
this issue. 
I will remove the feed from my list! ;-)

Thanks for all!

Original comment by pelat.la...@gmail.com on 27 Nov 2011 at 4:59

GoogleCodeExporter commented 9 years ago

> I will remove the feed from my list

Oh snap, then feedparser isn't doing a satisfactory job! Like I said, I'll take 
a look at this when I have an opportunity.

Original comment by kurtmckee on 27 Nov 2011 at 9:00

GoogleCodeExporter commented 9 years ago

Well, you cannot correct every bug that is not yours ;) Dont bother with it! :) 
Your report on Wordpress was great! :)

Original comment by pelat.la...@gmail.com on 27 Nov 2011 at 9:02

bpinkert / feedparser

truncated utf-8 byte sequences force an encoding override #306