Closed GoogleCodeExporter closed 9 years ago
Well, shoot! feedparser's current behavior is actually correct, and is
following RFC3023; Baidu's server is broken. It *kills* me that I'm going to
have to break compatibility to fix this (and I might break other things
unexpectedly!), but getting the encoding "right" (that is, by ignoring
standards and by making guesses) is just too important.
The problem is that Baidu is incorrectly serving the feed with the MIME type
'text/xml' (which is "supposed" to be for XML that could be viewed simply as a
text file). According to the RFC, processors are explicitly forbidden from
using the encoding in the XML processing instruction because it's 'text/'. This
wouldn't normally be a problem, except that Baidu also fails to specify a
character set in the HTTP headers. Consequently, the default character set
'us-ascii' is assumed, which fails miserably, and so feedparser falls back on
'utf-8'.
I've got to modify existing unit tests when I fix this, which suggests that I
might break existing functionality for certain feeds, so this probably won't
end up in the next bugfix release.
*WHEW* So now I think I've documented the problem, in case people come back to
this report wondering why I changed the code.
Original comment by kurtmckee
on 25 Apr 2012 at 7:14
This issue was closed by revision r737.
Original comment by kurtmckee
on 28 May 2012 at 6:39
After spending a lot of time getting familiar with feedparser's encoding
detection code (and revamping it quite a bit), I was able to fix this issue by
simply guaranteeing that gb2312 is always upgraded to gb18030 (gb18030 is a
superset of gb2312). Thank goodness it wasn't as difficult as I had first
thought!
Thanks for reporting this!
Original comment by kurtmckee
on 28 May 2012 at 6:41
Original issue reported on code.google.com by
flytwoki...@gmail.com
on 19 Apr 2012 at 3:32