Regular expression in strip_invalid_utf_8_chars is invalid Ruby code under Ruby 2.0

The regular expressions in the strip_invalid_utf_8_chars function (starting at line 326 of lib/oai/client.rb) causes Ruby 2.0.0 to produce an "invalid mutibyte escape" error. This occurs when Ruby parses the script, because it doesn't get to the point even of trying to run the function.

This is probably because Ruby 2.0 has changed the default character encoding for Ruby scripts from US-ASCII to UTF-8. The regular expression matches sequences of bytes which are invalid UTF-8, so the regular expression itself contains byte sequences which are invalid UTF-8.

A suggested fix is to simply delete the function entirely, since it is only used in the do_request method (in the same file) to attempt to clean up results returned from an SRU provider. As implemented, it will do the wrong thing with correct non-UTF-8 encodings (blindly converting bytes to question mark characters) and silently hide real errors in faulty UTF-8 encodings. Real errors should be exposed rather than hidden.

The author must have had a use case for implementing this, so simply deleting the function might be a "backward incompatible change" that causes some programs to suddenly fail. But it could be argued that those programs should have been failing in the first place, and it was a bug that they were pretending they were working!

If possible, it would also be good to ensure the same behaviour occurs (when encountering malformed UTF-8 responses) with both Ruby 2.0 and Ruby 1.9.3.

code4lib / ruby-oai

Regular expression in strip_invalid_utf_8_chars is invalid Ruby code under Ruby 2.0 #30