alexandrnikitin / bloom-filter-scala

Bloom filter for Scala, the fastest for JVM
https://alexandrnikitin.github.io/blog/bloom-filter-for-scala/
MIT License
376 stars 57 forks source link

CanGenerateHashFromString is broken in JDK 9+ when string contains non-latin characters or +XX:-CompactStrings JVM flag is used #53

Open seanrohead opened 3 years ago

seanrohead commented 3 years ago

CanGenerateHashFromStringByteArray, which is used for JDK9+, assumes that the string is stored using the UTF-8 character encoding and that the length of the underlying byte[] is the same as the length of the string. This assumption only holds true if the string only contains characters from the ISO-8859-1/Latin-1 character set. If the string contains other characters, the string is stored in the underlying byte array as UTF-16 characters and the length of the byte array is 2x the number of characters in the string. Additionally, it is possible to disable this storage optimization using the +XX:-CompactStrings JVM flag in which case all strings are stored as UTF-16 characters. See here and here for more information.

seanrohead commented 3 years ago

I opened a pull request for this: https://github.com/alexandrnikitin/bloom-filter-scala/pull/54/files

yarosman commented 3 years ago

Have similar error but with CanGenerateHashFromString

Caused by: java.lang.ClassCastException: class [B cannot be cast to class [C ([B and [C are in module java.base of loader 'bootstrap')
    at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:27)
    at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:23)
seanrohead commented 3 years ago

@yarosman Are you using the latest version of the library? That issue was fixed in 0.13.0.

yarosman commented 3 years ago

@yarosman Are you using the latest version of the library? That issue was fixed in 0.13.0.

@seanrohead We use 0.13.1

seanrohead commented 3 years ago

@yarosman Are you loading the bloom filter using serialization by any chance?

yarosman commented 3 years ago

@seanrohead Yes, we do. And I found that we don't use predefined method writeTo/readTo therefore we serialize with CanGenerateHashFrom, which dependent from java. Or you have another explanation or idea ?

yufan022 commented 2 years ago

Have similar error but with CanGenerateHashFromString

Caused by: java.lang.ClassCastException: class [B cannot be cast to class [C ([B and [C are in module java.base of loader 'bootstrap')
    at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:27)
    at bloomfilter.CanGenerateHashFrom$CanGenerateHashFromString$.generateHash(CanGenerateHashFrom.scala:23)

Did you try use CanGenerateHashFromStringByteArray?