Fix invalid codepoints in doubly-encoded character references

norman commented 8 years ago

This PR is motivated by an observation by @hfaulds that documents with badly-encoded HTML character references such as &#56256;&#56324; will cause the library to raise the following error:

RangeError: invalid codepoint 0xDBC0 in UTF-8
        from /usr/lib/ruby/2.3.0/cgi/util.rb:79:in `chr'
        from /usr/lib/ruby/2.3.0/cgi/util.rb:79:in `block in unescapeHTML'
        from /usr/lib/ruby/2.3.0/cgi/util.rb:66:in `gsub'
        from /usr/lib/ruby/2.3.0/cgi/util.rb:66:in `unescapeHTML'

Character references are used to specify any UTF-8 character in HTML by codepoint.

If an invalid codepoint is given anywhere in a string, CGI.unescape_html will raise a RangeError.

The method added here will examine a string for character references, and when found, replace them with the decoded UTF-8 character. If the given codepoint is invalid, it will replace the character reference with the empty string. In Rets, this problem can arise when a parsed document contains a character reference to an invalid codepoint that is HTML-encoded two times. Nokogiri will decode it the first time, passing through the now singly-encoded invalid character reference to CGI.unescape_html.

phiggins commented 8 years ago

Did we ever come up with a test for this one?

I'd prefer to stick with the stdlib rather than introduce a dependency.

norman commented 8 years ago

Did we ever come up with a test for this one?

No, I've been working on other things. I will make time for this soon.

norman commented 8 years ago

@phiggins I worked on this some more and found the root of the problem. I've since force-pushed and rewrote the description. Can you give it another look over? Thanks!

phiggins commented 8 years ago

It looks fine codewise, but I couldn't verify if this actually fixes the issues we're having with CINCYMLS.

phiggins commented 8 years ago

This might be a total red herring, but I noticed that we mess with the encoding here: https://github.com/estately/rets/blob/master/lib/rets/client.rb#L337-340

Is it possible we're doing something wrong before we even get to the parser?

norman commented 8 years ago

I couldn't verify if this actually fixes the issues we're having with CINCYMLS.

I was able to verify it locally previously when I resolved the problem using Escape Utils. It's entirely possible that the MLS in question is no longer sending the record with the bad codepoint.

In order for this to be an issue, the document must not only contain an invalid UTF-8 codepoint encoded as a character reference, but have it HTML-encoded it twice: Once to make it past Nokogiri, which HTML-decodes the documents and discards invalid codepoints, and twice to reach CGI.html_decode and the other code this patch adds. I don't think this will happen very often, so I'm not at all surprised you can't see it locally any more.

norman commented 8 years ago

This might be a total red herring, but I noticed that we mess with the encoding here:
res.body.encode!("UTF-8", res.body.encoding, :invalid => :replace, :undef => :replace)
Is it possible we're doing something wrong before we even get to the parser?

No, I don't think this would cause problems downstream in this library because it will do one of two things: return a valid UTF-8 string, or raise an error. If it were mangling the string in such a way that it appeared to succeed but produced invalid UTF-8, that would be a truly spectacular bug in Ruby. That's not impossible, but very unlikely.

In any event, that would not be relevant to the bug at hand because the non-ASCII characters in the response body that caused the error were encoded as HTML character references, making the document pure ASCII. The call to encode! on that string would not touch the string's bytes because ASCII is already a proper subset of UTF-8.

hfaulds commented 8 years ago

Looks good to me :)

tdtran commented 8 years ago

Theoretically speaking the input text is indeed invalid and the code has every right to reject it. Practically speaking since it's a server error and the client has near zero chance to get it fixed the gem should at least offer an option to skip the error and continue. So I agree this is good to merge. But we should log the fact that the original input has been modified, perhaps by printing a message to stderr and quote the input fragment.

estately / rets

Fix invalid codepoints in doubly-encoded character references #186