concerto / concerto-simple-rss

Simple RSS Dynamic Content for Concerto 2
Other
6 stars 2 forks source link

Proper handle feed encoding #41

Closed simplysoft closed 10 years ago

simplysoft commented 10 years ago

Resulting concerto content from the feeds with some special characters (like umlauts or nonbreaking space) did contain � (http://www.fileformat.info/info/unicode/char/0fffd/index.htm) characters. This was caused by by re-encoding UTF-8 content to UTF-8, because the ruby string was marked as ASCII-8BIT encoded

1) net/http itself does not handle content type encoding (quite a surprise), so before this fix, it always returned a string with ASCII-8BIT even if the feed content was actually UTF-8. see https://bugs.ruby-lang.org/issues/2567 Luckily open-uri, a wrapper around net/http does handle the content type encoding correctly (https://github.com/ruby/ruby/blob/trunk/lib/open-uri.rb#L438-454)

2) Turns out the xslt.serve()does also not respect content encoding at all, regardless of the input xml encoding or the xslt stylesheet output encoding, the data is always ASCII-8BIT. As a workaround, this fix now forces encoding to be the same as the incoming xml. Maybe ruby-xlst should be replaced by nokogiri.org which looks like it should handle encoding correctly.