encoding gb2312 isn't always upgraded to gb18030

GoogleCodeExporter commented 9 years ago

for this feed:
http://hi.baidu.com/%CE%DE%BC%AB%C3%DE/rss

The encoding specified in xml processing instruction is 'gb2312', but 
feedparser uses 'utf-8' to parse it.

Original issue reported on code.google.com by flytwoki...@gmail.com on 19 Apr 2012 at 3:32

GoogleCodeExporter commented 9 years ago

Well, shoot! feedparser's current behavior is actually correct, and is 
following RFC3023; Baidu's server is broken. It *kills* me that I'm going to 
have to break compatibility to fix this (and I might break other things 
unexpectedly!), but getting the encoding "right" (that is, by ignoring 
standards and by making guesses) is just too important.

The problem is that Baidu is incorrectly serving the feed with the MIME type 
'text/xml' (which is "supposed" to be for XML that could be viewed simply as a 
text file). According to the RFC, processors are explicitly forbidden from 
using the encoding in the XML processing instruction because it's 'text/'. This 
wouldn't normally be a problem, except that Baidu also fails to specify a 
character set in the HTTP headers. Consequently, the default character set 
'us-ascii' is assumed, which fails miserably, and so feedparser falls back on 
'utf-8'.

I've got to modify existing unit tests when I fix this, which suggests that I 
might break existing functionality for certain feeds, so this probably won't 
end up in the next bugfix release.

*WHEW* So now I think I've documented the problem, in case people come back to 
this report wondering why I changed the code.

Original comment by kurtmckee on 25 Apr 2012 at 7:14

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

This issue was closed by revision r737.

Original comment by kurtmckee on 28 May 2012 at 6:39

Changed state: Fixed

GoogleCodeExporter commented 9 years ago

After spending a lot of time getting familiar with feedparser's encoding 
detection code (and revamping it quite a bit), I was able to fix this issue by 
simply guaranteeing that gb2312 is always upgraded to gb18030 (gb18030 is a 
superset of gb2312). Thank goodness it wasn't as difficult as I had first 
thought!

Thanks for reporting this!

Original comment by kurtmckee on 28 May 2012 at 6:41

Changed title: encoding gb2312 isn't always upgraded to gb18030

dimones / feedparser

encoding gb2312 isn't always upgraded to gb18030 #346