FastFilter / fastfilter_java

Fast Approximate Membership Filters (Java)
Apache License 2.0
238 stars 27 forks source link

PasswordLookup.java uses wrong encoding to generate hash #4

Closed thoweber closed 5 years ago

thoweber commented 5 years ago

Nice work, however you are using ASCII encoding to generate the hash for the lookup. Comments state, that you did not know the original encoding.

The encoding used for generating the hashes is UTF-8 as state by Troy Hunt, see: https://www.troyhunt.com/introducing-306-million-freely-downloadable-pwned-passwords/#comment-3641591030

thomasmueller commented 5 years ago

I must be blind but the link you gave only shows the question, and not the answer.

Also, I have asked the same question again a month ago, see https://www.troyhunt.com/ive-just-launched-pwned-passwords-version-2/#comment-4388083390 (so far no answer).

I'm not convinced it's UTF-8. I tested with German (my native language), and found that very common words with Umlauts (e.g. Müller) are not found when using UTF-8. They are found when using ASCII. That's why I used ASCII. Maybe you can give some example of a common word that is found with UTF-8?

lemire commented 5 years ago

So I tested this out using the C++ version...

First, I assume that whenever Java does not recognize a character, it maps it to '?'. So I assume that 'Müller' becomes 'M?ller' in ASCII.

Anyhow, I find that both 'Müller' (7 bytes) and 'M?ller' (6 bytes) are in the set.

$ ./query_filter -s filter.bin Müller
using database: filter.bin
We are going to hash your input.
hashing this word: Müller (length in bytes = 7)
SHA1 hash cd2562f4 b5ecc3e0 2e669737 fde72b4c a6f2039f
hexval = 0xcd2562f4b5ecc3e0
Xor filter detected.
I expect the file to span 678357069 bytes.
memory mapping is a success.
Probably in the set.
Processing time 75.000 microseconds.
Expected number of queries per second: 13333.333

$ ./query_filter -s filter.bin M?ller
using database: filter.bin
We are going to hash your input.
hashing this word: M?ller (length in bytes = 6)
SHA1 hash e23d4bba 624a971e e4497d1a 89dc6d70 a5d383ed
hexval = 0xe23d4bba624a971e
Xor filter detected.
I expect the file to span 678357069 bytes.
memory mapping is a success.
Probably in the set.
Processing time 68.000 microseconds.
Expected number of queries per second: 14705.883
lemire commented 5 years ago

(I should add that my shell was configured with UTF-8.)

thoweber commented 5 years ago

@thomasmueller German is my native language as well ;-)

I tested with "Müller" UTF-8 --> works "München" UTF-8 --> works "Nürnberg" UTF-8 --> doesn't work, but works in ASCII (I assume the password has been mangled by the hacked site)

Regarding the link: These comments are hard to link to, because they are not all loaded initially. I had to click the "load more comments" button to get to the answer.

lemire commented 5 years ago

I confirm...

$ ./query_filter -s filter.bin Nürnberg
using database: filter.bin
We are going to hash your input.
hashing this word: Nürnberg (length in bytes = 9)
SHA1 hash aae81fd2 ff75d864 ea9c55b1 a3e2246d da35c6e9
hexval = 0xaae81fd2ff75d864
Xor filter detected.
I expect the file to span 678357069 bytes.
memory mapping is a success.
Surely not in the set.
Processing time 73.000 microseconds.
Expected number of queries per second: 13698.630

$ ./query_filter -s filter.bin N?rnberg
using database: filter.bin
We are going to hash your input.
hashing this word: N?rnberg (length in bytes = 8)
SHA1 hash e736a6b7 cad2b466 8a0ba06f e4820072 4a08b5db
hexval = 0xe736a6b7cad2b466
Xor filter detected.
I expect the file to span 678357069 bytes.
memory mapping is a success.
Probably in the set.
Processing time 87.000 microseconds.
Expected number of queries per second: 11494.253
lemire commented 5 years ago

I wonder what happens with Chinese or some other language where ASCII is a no-go?

thoweber commented 5 years ago

screenshot

Attached is a screenshot of Troy's answer

thomasmueller commented 5 years ago

screenshot

Attached is a screenshot of Troy's answer

Ah! I don't see his answer until I click on "show more comments"... "show more replies". Then I also see it.

Given that some passwords are mangled, I guess it makes sense to test with both UTF-8 and ASCII. That's more lookups, but not really a problem.

lemire commented 5 years ago

Note that their online tool does not do multiple lookups.

thomasmueller commented 5 years ago

Yes, the online tool doesn't... It doesn't find "jörg", "Jörg", "jürg", "Jürg", which I would expect to be there as "Jörg" and "Jürg" are very common first names in Germany and Switzerland.

So I changed the code to use both ASCII and UTF-8 encoding: https://github.com/FastFilter/fastfilter_java/commit/0a33f04ced0fa6382bda51d3dc6d093efb9df5cb

thomasmueller commented 5 years ago

I consider this issue closed for now.