SpringMT / zstd-ruby

Ruby binding for zstd(Zstandard - Fast real-time compression algorithm)
https://github.com/facebook/zstd
BSD 3-Clause "New" or "Revised" License
69 stars 16 forks source link

String#hash of strings generated by decompress_buffered is different from that of literal strings in Ruby 3.3.0 or later #89

Closed abicky closed 6 months ago

abicky commented 6 months ago

We encountered a strange problem where we could not look up the value of a Hash whose key was generated by Zstd.decompress using the same multibyte string literal. I found that the problem can be reproduced only if the compressed data doesn't have the Frame_Content_Size information, that is, decompress_buffered is used.

I'm not sure if it is a bug of Ruby or zstd-ruby.

Here is the reproducible code:

require 'zstd-ruby'

# This constant was generated by the following Java code
# and the compressed data doesn't have Frame_Content_Size:
#
#   import com.github.luben.zstd.RecyclingBufferPool;
#   import com.github.luben.zstd.ZstdOutputStreamNoFinalizer;
#   import org.apache.kafka.common.utils.ByteBufferOutputStream;
#   import javax.xml.bind.DatatypeConverter;
#
#   import java.io.BufferedOutputStream;
#   import java.io.DataOutputStream;
#   import java.io.IOException;
#   import java.nio.charset.StandardCharsets;
#
#   class Main {
#       public static void main(String[] args) throws IOException {
#           ByteBufferOutputStream buffer = new ByteBufferOutputStream(10);
#           DataOutputStream stream = new DataOutputStream(new BufferedOutputStream(new ZstdOutputStreamNoFinalizer(buffer, RecyclingBufferPool.INSTANCE), 16 * 1024));
#           stream.write("あ".getBytes(StandardCharsets.UTF_8));
#           stream.close();
#
#           System.out.println(DatatypeConverter.printHexBinary(buffer.buffer().array()));
#       }
#   }
COMPRESSED_DATA_HEX = '28B52FFD0058180000E38182010000'

data = Zstd.decompress([COMPRESSED_DATA_HEX].pack('H*')).force_encoding('UTF-8')
expected_data = 'あ'
puts <<~MSG
  RUBY_VERSION: #{RUBY_VERSION}
  data: #{data}
  data == expected_data: #{data == expected_data}
  data.equal?(expected_data): #{data.equal?(expected_data)}
  data.hash: #{data.hash}
  expected_datadata.hash: #{expected_data.hash}
  { expected_data => 1 }.has_key?(data): #{{ expected_data => 1 }.has_key?(data)}
MSG

Here is the output:

RUBY_VERSION: 3.3.1
data: あ
data == expected_data: true
data.equal?(expected_data): false
data.hash: 3328309050837243483
expected_datadata.hash: 3486244608461787623
{ expected_data => 1 }.has_key?(data): false

As you can see, { expected_data => 1 }.has_key?(data) is false even though data == expected_data is true.

In Ruby 3.2.2, the result is expected.

RUBY_VERSION: 3.2.2
data: あ
data == expected_data: true
data.equal?(expected_data): false
data.hash: 3278076437348888334
expected_datadata.hash: 3278076437348888334
{ expected_data => 1 }.has_key?(data): true
SpringMT commented 6 months ago

I checked simple compress and decompress.

require 'zstd-ruby'

expected_data = "あ"
data = Zstd.decompress(Zstd.compress(expected_data)).force_encoding('UTF-8')

puts <<~MSG
  RUBY_VERSION: #{RUBY_VERSION}
  data: #{data}
  data == expected_data: #{data == expected_data}
  data.equal?(expected_data): #{data.equal?(expected_data)}
  data.hash: #{data.hash}
  expected_datadata.hash: #{expected_data.hash}
  { expected_data => 1 }.has_key?(data): #{{ expected_data => 1 }.has_key?(data)}
MSG
RUBY_VERSION: 3.3.1
data: あ
data == expected_data: true
data.equal?(expected_data): false
data.hash: -3461927809926074668
expected_datadata.hash: -3461927809926074668
{ expected_data => 1 }.has_key?(data): true

I delve deeper into it.

SpringMT commented 6 months ago
irb(main):025> Zstd.decompress([COMPRESSED_DATA_HEX].pack('H*')).force_encoding('UTF-8').codepoints
=> [227, 129, 130]
irb(main):026> "あ".codepoints
=> [12354]
irb(main):027> Zstd.decompress(Zstd.compress("あ")).force_encoding('UTF-8').codepoints
=> [12354]
abicky commented 6 months ago

@SpringMT Thank you for your quick response! My colleague asked a question on ruby-jp Slack (cf. https://ruby-jp.slack.com/archives/CLWSHA76V/p1716424178458799) and mame san found out that String#ascii_only? of the decompressed string returned true unexpectedly:

require 'zstd-ruby'

COMPRESSED_DATA_HEX = '28B52FFD0058180000E38182010000'

data = Zstd.decompress([COMPRESSED_DATA_HEX].pack('H*')).force_encoding('UTF-8')
expected_data = 'あ'
puts <<~MSG
  RUBY_VERSION: #{RUBY_VERSION}
  data: #{data}
  data == expected_data: #{data == expected_data}
  data.equal?(expected_data): #{data.equal?(expected_data)}
  data.hash: #{data.hash}
  expected_data.hash: #{expected_data.hash}
  data.ascii_only?: #{data.ascii_only?}
  expected_data.ascii_only?: #{expected_data.ascii_only?}
  { expected_data => 1 }.has_key?(data): #{{ expected_data => 1 }.has_key?(data)}
MSG
RUBY_VERSION: 3.3.1
data: あ
data == expected_data: true
data.equal?(expected_data): false
data.hash: 2035519590434718668
expected_data.hash: -25190453235001085
data.ascii_only?: true
expected_data.ascii_only?: false
{ expected_data => 1 }.has_key?(data): false
SpringMT commented 6 months ago

https://github.com/SpringMT/zstd-ruby/pull/90 should be able to fix it.

SpringMT commented 6 months ago

I released 1.5.6.6 https://rubygems.org/gems/zstd-ruby/versions/1.5.6.6. Please test it 🙇

abicky commented 6 months ago

Awesome! I highly appreciate your support 🙇