bugsnag / bugsnag-ruby

BugSnag error monitoring & reporting software for rails, sinatra, rack and ruby
https://docs.bugsnag.com/platforms/ruby
MIT License
246 stars 174 forks source link

Fix Unicode encoding issues with detailed_message #817

Closed imjoehaines closed 4 months ago

imjoehaines commented 4 months ago

Goal

Ruby 3.2's Exception#detailed_message method returns a string that is encoded as UTF-8 but has a String#encoding set to ASCII_8BIT. This causes issues when we later convert the string to UTF-8 (for sending as JSON) because the conversion is invalid:

irb(main):001> a = Exception.new("Обичам те\n大好き")
=> #<Exception:"Обичам те\n大好き">
irb(main):002> a.detailed_message
=> "\xD0\x9E\xD0\xB1\xD0\xB8\xD1\x87\xD0\xB0\xD0\xBC \xD1\x82\xD0\xB5 (Exception)\n\xE5\xA4\xA7\xE5\xA5\xBD\xE3\x81\x8D"
irb(main):003> a.detailed_message.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004> a.detailed_message.encode(Encoding::UTF_8, invalid: :replace, undef: :replace)
=> "������������ ���� (Exception)\n���������"

If the detailed message is forced to UTF-8 then it works as expected:

irb(main):005> b = a.detailed_message.force_encoding(Encoding::UTF_8)
=> "Обичам те (Exception)\n大好き"

This can then be sent as JSON correctly

You can compare the bytes in this string with the "ASCII-8BIT" encoded string above and they match exactly[^1]:

irb(main):06> b.bytes.map { |byte| byte.to_s(16) }.map(&:upcase)
=> ["D0", "9E", "D0", "B1", "D0", "B8", "D1", "87", "D0", "B0", "D0", "BC", "20", "D1", "82", "D0", "B5", "20", "28", "45", "78", "63", "65", "70", "74", "69", "6F", "6E", "29", "A", "E5", "A4", "A7", "E5", "A5", "BD", "E3", "81", "8D"]

The bit in the middle is (Exception)\n that's displayed literally in the ASCII-8BIT output:

irb(main):017> ["20", "28", "45", "78", "63", "65", "70", "74", "69", "6F", "6E", "29", "A"].map { |x| x.to_i(16) }.pack("C*")
=> " (Exception)\n"

Testing

[^1]: You would expect this as force_encoding doesn't changing the underlying bytes, but this decoding back into the original input proves that it's really a UTF-8 string