Open GoogleCodeExporter opened 9 years ago
The feed in question (a copy of which is attached for longevity) is hilariously
malformed. The problem stems from the site encoding Unicode characters using
UTF-8...and then truncating them mid-byte sequence. Consequently, the first few
bytes of the multi-byte character are included, but the final byte is missing,
resulting in binary garbage that causes feedparser to decode the entire feed as
windows-1252.
Here's an example of the problem. In one of the original entries, the author
uses the following character, which is encoded to a three byte sequence in
UTF-8:
>>> u'’'.encode('utf-8')
'\xe2\x80\x99'
Doing a hexdump of the feed reveals that that character was truncated to the
single byte '\xe2', which is garbage.
I don't believe that this is a problem that feedparser *has* to overcome,
though it would be nice if it did. A nicer solution would be to report this as
a bug in Wordpress 3.2.1 (which is the software powering the site). They're
gearing up for their 3.3 release, so this would be an ideal opportunity to
report the bug so it gets fixed!
Original comment by kurtmckee
on 26 Nov 2011 at 5:40
Attachments:
Thanks for your answer kurtmckee, I'm going to report it.
Original comment by pelat.la...@gmail.com
on 26 Nov 2011 at 5:49
I'm actually trying to do so myself, but I discovered they did a site-wide
password reset earlier this year so I'm currently trying to get access again.
It looks like this is a known problem that was fixed four years ago:
https://core.trac.wordpress.org/ticket/6077
I don't know if it's appropriate to reopen that ticket or open a new one.
Original comment by kurtmckee
on 26 Nov 2011 at 5:52
After better researching the issue I've created a new ticket:
https://core.trac.wordpress.org/ticket/19368
If this is a problem in Wordpress, hopefully it'll be resolved quickly to
everyone's benefit! Meanwhile, I'm going to leave this report open until I have
an opportunity to figure out if there's a good way to fix this in feedparser
without b0rking existing support. My big concern is that trusting the declared
encoding using code like
'\xe2'.decode('utf-8', 'replace')
will fix this specific case but will break the case that the encoding is
completely misdeclared.
Original comment by kurtmckee
on 27 Nov 2011 at 6:05
Thanks to have reported it! I have checked the Wordpress Ticket! As I'm not the
owner of the site, I can't help them to identify which plugin is affected by
this issue.
I will remove the feed from my list! ;-)
Thanks for all!
Original comment by pelat.la...@gmail.com
on 27 Nov 2011 at 4:59
> I will remove the feed from my list
Oh snap, then feedparser isn't doing a satisfactory job! Like I said, I'll take
a look at this when I have an opportunity.
Original comment by kurtmckee
on 27 Nov 2011 at 9:00
Well, you cannot correct every bug that is not yours ;) Dont bother with it! :)
Your report on Wordpress was great! :)
Original comment by pelat.la...@gmail.com
on 27 Nov 2011 at 9:02
Original issue reported on code.google.com by
pelat.la...@gmail.com
on 26 Nov 2011 at 3:01