description tag should be in UTF-8 encoding but it is in ASCII-8BIT

cardmagic / simple-rss

A simple, flexible, extensible, and liberal RSS and Atom reader for Ruby. It is designed to be backwards compatible with the standard RSS parser, but will never do RSS generation.

https://github.com/cardmagic/simple-rss

Other

225 stars 68 forks source link

description tag should be in UTF-8 encoding but it is in ASCII-8BIT #15

Open emaillenin opened 10 years ago

emaillenin commented 10 years ago

Tried this also:

l.description.force_encoding('UTF-8').encode!('UTF-8',:invalid => :replace,:replace => '')

But still ending up with: Uncaught exception: invalid byte sequence in UTF-8

emaillenin commented 10 years ago

Using

l.description.to_s.encode('UTF-8', {:invalid => :replace, :undef => :replace, :replace => ''})

solves the issue but we lose the original UNICODE character that was in the source.

eugene-nikolaev commented 10 years ago

Got same issue

eugene-nikolaev commented 10 years ago

There is content.force_encoding('binary') in the if condition:

 def unescape(content)
    if content.respond_to?(:force_encoding) && content.force_encoding("binary") =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
        CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    else
        content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    end
  end

force_encoding method changes string encoding inplace, so every string returned by simple-rss will be encoded to ASCII 8-bit...

I'd rewrite that the following way, but unsure that for this 'if' as well. So I don't make a pull request.

  def unescape(content)
    if content.respond_to?(:force_encoding) && encode_binary(content) =~ /([^-_.!~*'()a-zA-Z\d;\/?:@&=+$,\[\]]%)/n then
        CGI.unescape(content).gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    else
        content.gsub(/(<!\[CDATA\[|\]\]>)/,'').strip
    end
  end

  def encode_binary(content)
    content.encode('binary', {:invalid => :replace, :undef => :replace, :replace => ''})
  end

emaillenin commented 10 years ago

Hi @evgeniynickolaev can you please test it with a feed that has non latin characters? Meanwhile I will try to post a sample where it failed for me.

eugene-nikolaev commented 10 years ago

Yes, I've tested it with a feed containing the following unicode symbols - \xE2\x80\x99. But not sure it is 100% correct as not fully understand the logic if this unescaping.

terotil commented 10 years ago

Just as @evgeniynickolaev pointed out, the immediate source of the problem is force_encoding("binary"), which (even though the name does not end in bang) mutates the string object in place. However, apparetly the reason for adding the force_encoding was "n" flag in the regexp within the conditional introduced in https://github.com/cardmagic/simple-rss/commit/ac95fb4cf69bbbe0d3a1a1c31f21bf2acbf25d1e. It says that the regex should be interpreted as binary (ASCII-8BIT) no matter what the source encoding is (see http://www.ruby-doc.org/core-2.1.3/Regexp.html#class-Regexp-label-Encoding).

I'll throw in a fix which simply removes all the fiddling with encodings. I can't figure out any reason why there would be any need for that.

chengguangnan commented 9 years ago

I run into the same problem. This gem is not well maintained. I'm go with other gems.

jeremyhaile commented 9 years ago

@chengguangnan what other gem have you found that is well maintained?

chengguangnan commented 9 years ago

Hi @jeremyhaile, I switched to feedjira.