crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.45k stars 1.62k forks source link

Hashes mishandle NaN keys #3923

Open ezrast opened 7 years ago

ezrast commented 7 years ago

Hashes (and Sets) fail to find NaN when performing lookup.

h = {Float64::NAN => "value"}  # {nan.0 => "value"}
# h[Float64::NAN]              # Missing hash key: nan.0
h.delete(Float64::NAN)         # nil
h                              # {nan.0 => "value"}

Presumably, the lookup code is confused by NaN == NaN being false.

zatherz commented 7 years ago

This by itself is not really a bug (Float64::NAN != Float64::NAN, http://stackoverflow.com/questions/1565164/what-is-the-rationale-for-all-comparisons-returning-false-for-ieee754-nan-values).

The real bug is that h.delete_if &.nan? will not function properly, because delete_if uses delete with the key, which obviously doesn't work. h.select &.nan? works properly.

Basically, either delete_if on Hash should be modified to not use delete, or it should have a special NaN case, or Crystal should not follow the IEEE 754 standard, which is the least ideal option.

ezrast commented 7 years ago

You don't need to break the standard, you just need some means of comparing Set/Hash keys for identity that is distinct from the standards-conformant equality operator. I think it's unreasonable not to consider Set{Float64::NAN}.includes?(Float64::NAN) == false as a bug.

Ruby Hashes, for example, determine identity through the #eql? method (see Hash Keys and Object#eql?). That's clearly not the whole story since Float::NAN.eql? Float::NAN is false over there (I think there's another identity check that happens at the C level before falling back to #eql?) but it shows you what I mean.

zatherz commented 7 years ago

== is implemented as an overloadable operator in both Ruby and Crystal.

class Test
  def ==(other : Test)
    true
  end
en

Test.new == Test.new # => true
Test.new != Test.new # => false

class Test2
end

Test2.new == Test2.new # => false
Test2.new != Test2.new # => true

The standard requires NaN to not equal itself.

This is the implementation of Ruby's Hash#delete:

VALUE
rb_hash_delete_entry(VALUE hash, VALUE key)
{
    st_data_t ktmp = (st_data_t)key, val;

    if (!RHASH(hash)->ntbl) {
  return Qundef;
    }
    else if (RHASH_ITER_LEV(hash) > 0 &&
       (st_delete_safe(RHASH(hash)->ntbl, &ktmp, &val, (st_data_t)Qundef))) {
  FL_SET(hash, HASH_DELETED);
  return (VALUE)val;
    }
    else if (st_delete(RHASH(hash)->ntbl, &ktmp, &val)) {
  return (VALUE)val;
    }
    else {
  return Qundef;
    }
}
// some comments omitted here
VALUE
rb_hash_delete(VALUE hash, VALUE key)
{
    VALUE deleted_value = rb_hash_delete_entry(hash, key);

    if (deleted_value != Qundef) { /* likely pass */
  return deleted_value;
    }
    else {
  return Qnil;
    }
}

st_delete_safe calls st_general_delete which calls find_entry which uses PTR_EQUAL which uses EQUAL which looks like this:

#define EQUAL(tab,x,y) ((x) == (y) || (*(tab)->type->compare)((x),(y)) == 0)

This uses some sort of special function called compare which I assume is what causes NaN to equal NaN but only in that specific situation.

There is also a deeper problem: the bucket_index method of Hash which is used by delete overflows with the hash of Float64::NAN (9221120237041090560). It works if you convert the to_u32 and to_i calls to to_u64 and to_i64, for which I'm gonna submit a pull request tomorrow.

From that point it'll take just adding a special case for NaN or using some special method with that special case.

ezrast commented 7 years ago

The x and y passed into that EQUAL macro are object ID's, not the actual contents of any variable. Since individual numeric values always have the same object ID throughout the lifetime of a Ruby program, the (x) == (y) part will always be true when x and y refer to the same object or the same number, even if that "number" is NAN. Any equality or hashing methods defined in Ruby space (which eventually get called by that compare function, I believe) are irrelevant at that point. If NAN were a special case you couldn't get away with this: [edit: this is misleading; see MaxLap's comment below]

class Evil
  def ==(other); false; end
  def eql?(other); false; end
  def equal?(other); false; end
  def hash; rand; end
end

e = Evil.new
h = {}
h[e] = "one"
h[e] = "two"  # this looks like it should create a second entry in the hash table
# but ruby handles it properly
puts h  # prints {#<Evil:0x00561482427f38>=>"two"}
h.delete e
puts h  # prints {}

I have no idea if any of this is relevant to how things could be implemented in Crystal, but there it is.

Sorry if my issue was unclear initially. I'm aware of IEEE 754 and didn't mean to suggest that something was wrong with your float implementation. I just wanted to point out some surprising behavior I noticed while teaching myself Crystal.

zatherz commented 7 years ago

You're right. That makes a lot of sense.

class Evil
  def ==(other); false; end
  def eql?(other); false; end
  def equal?(other); false; end
  def hash; rand; end
end

e = Evil.new
h = {} of Evil => String
h[e] = "one"
h[e] = "two"
puts h
h.delete e
puts h

Output:

{#<Evil:0xac8fc0> => "one", #<Evil:0xac8fc0> => "two"}
{#<Evil:0xac8fc0> => "one", #<Evil:0xac8fc0> => "two"}
MaxLap commented 7 years ago

Ruby's handling of your Evil is because of 2 things:

I'll just add that because 0 and 0.0 have a hash collision, there is the following problem:

puts({ 0 => 1, 0.0 => 1000, 1.0 => 10, 1 => 100 }) #=> {0 => 1000, 1.0 => 10, 1 => 100}

Notice that the 0.0 key is missing, and it's value was assigned to the 0 key. This is the same problem as the NAN not matching, caused by using the == operator. In this case, it's because 0 == 0.0, which is not something that you want in a hash key context!

There needs to be a special method dedicated to hash equality. In ruby, this is the terribly named #eql? method. Please use a more explicit name like hash_same_key? or hash_equal_key?

drosehn commented 7 years ago

I'm not quite sure I understand the issue here, but my initial reaction is that crystal is handing NaN the way it should be handled, even when it used as a hash keys. The language has to treat every instance of NaN as if it is a different value than all other NaNs, unless it were to keep track of how each NaN was generated. (and that could get pretty painful...)

ezrast commented 7 years ago

My original issue is that it is possible to add a key to a hash table that cannot be retrieved through normal means. That seems antithetical to the purposes of Hashes and Sets. That Set{x}.includes?(x) can ever be false strikes me as a least-surprise violation and I don't see the use case for that behavior.

I do see from the discussion here and in chat that I held some incorrect presumptions about how NaN behaves in general and in Ruby, so if this is deemed a non-issue I won't object further. zatherz's and MaxLap's criticisms of the current implementation are still valid, of course.

There's some interesting reading on the issue here, which does support the current paradigm (tl;dr Go buys fully into the practice of making NaN keys irretrievable and hashes them randomly).

RX14 commented 7 years ago

If we accept that NaN should never be equal to itself, then shouldn't the correct behaviour for NaN's hashcode be to raise? There's no way to create a valid hashcode if it is invalid to compare NaN values, so why not raise? There as already some precedent for something like this in that NaN and Infinity raise when trying to serialize to JSON.

drosehn commented 7 years ago

Btw, I meant to add that I agree with @zatherz on this:

The real bug is that h.delete_if &.nan? will not function properly, because delete_if uses delete with the key, which obviously doesn't work. h.select &.nan? works properly.

And that given these three options:

Basically, either (1) delete_if on Hash should be modified to not use delete, or (2) it should have a special NaN case, or (3) Crystal should not follow the IEEE 754 standard, which is the least ideal option.

I would not want to see #3. I was also going to suggest that it might be good to document this issue with NaN keys in Hash#delete_if. However, I do like the idea of raise-ing an error if NaN is used as a key in a Hash. That will make the issue explicitly clear to the programmer at compile time, and then the programmer can decide exactly how it should be handled in their specific situation.

spalladino commented 7 years ago

I like the option of raising on NaN hashcode, but isn't there an assumption that hash functions should not raise? I think I'd prefer to go go's way (no pun intended) and have a random hash for NaN, and "fix" delete_if.

akzhan commented 7 years ago

Looks like we need to replace == comparisons in insert/find etc. with another one (that's == for most cases except NaN and INF).

RX14 commented 7 years ago

To make delete_if work properly, there needs to be a way to delete an entry given just the entry. The only sane way to do that would be to add a previous reference in Entry to the last entry in the bucket's linked list. This seems to me like a sign we should implement this along with open addressing which would make deletion much easier (only have to manage the insertion-order linked list).

I still maintain that NaN#hashcode should raise. I don't think there's an expectation that anything doesn't raise.

akzhan commented 7 years ago

NAN and INF are good candidates for keys sometimes. So I prefer to NAN.hash not raised.

drosehn commented 7 years ago

It seems to me that it would be very rare that NaN would be useful as the value for a key. Let's say a program is processing data where multiple key-values are NaN. Each one of those will go into a hash as a separate key, since NaN != NaN. And when you try to get the value of somehash[ NaN ], that will never succeed, no matter how many entries there are in the hash where the key is NaN.

It's one thing if you know a key might have the value of NaN, but I'm also thinking about the case where the value of a key is computed, and due to mistakes or oversights the program is creating NaN keys when the programmer is completely unaware that that is possible. And thus the program will run into odd behaviors due to the way that NaN-valued keys should be handled.

I guess I'm thinking about this from my own history. While I can imagine there might be some situation where it is useful to have NaN as a key that behaves as a NaN-value should behave, I have never wanted that. And if I wrote a program here NaN-values were being generated for keys to a hash, then I'd want the language to force me to make it explicit that NaN values were expected.

drosehn commented 7 years ago

I'll add that it's easy for me to think of programs where I'd compute a key for a hash, and that key would turn out to be NaN for valid reasons (which is to say "not due to bugs").

But if that's the case, I'd almost certainly want to handle those NaNs via some special processing. Maybe I'd map every one of those NaN-keys to a single constant-value key, so all of those cases would map to a single key-value pair. Or maybe I'd do something to store why the key ended up as NaN. But I wouldn't want to store each key-value pair as a separate entry in the hash, where all those entries would be unreachable via standard means.

jplatte commented 7 years ago

FWIW, Rust solves this by having constraints on the key type that float types don't fulfill. There, HashMap has a K: Hash + Ord constraint (where K is the key type), and f32 / f64 implement neither. You can however still compare floats, because there is the PartialOrd trait, which is implemented for f32 and f64 and which comparison operators are based on.

straight-shoota commented 3 years ago

NaN != NaN is a fundamental principle (or maybe more a convention, but point is it's not going to change). So there's no other way than to accept that. Inserting a key NaN into a hash table will produce multiple entries. That's just how NaN works. If you don't want that, you need to replace NaN with something else instead of using it as a hash key.

The main bug is h.reject! &.nan? doesn't work (#10511 fixes that).

Apart from that, there's is a performance issue in the hasher, because it uses the same hash value for any NaN. But since NaN != NaN we should try to make NaN.hash != NaN.hash, otherwise the hash table may suffer from an overcrowded bucket.

We should go the same way as Go and https://research.swtch.com/randhash in using a random hash value for NaN.

yxhuvud commented 3 years ago

you need to replace NaN with something else instead of using it as a hash key.

You will need to do that anyhow, as NaN entries wouldn't be fetchable or deleteable. As such they are useless in a hash table. Being able to insert an arbitrary amount of NaNs just open the hash table up to potential DoS attacks that spend all available memory in a long running table and strange potentially hard to find bugs where entries that were inserted can't be deleted without resorting to iterating through the hash. Eww.

asterite commented 3 years ago

What if we disallow NaN keys? Is there any use case for that? That is, we raise on that case.

straight-shoota commented 3 years ago

NaN entries wouldn't be fetchable or deleteable. As such they are useless in a hash table.

@yxhuvud Deleting NaN entries is possible and it works correctly in all cases with #10513 merged. Fetching and deleting aren't the only operations on a hash table and they even work for any other float value. Even if there are NaNs in the table. I don't buy that they're useless.

Being able to insert an arbitrary amount of NaNs just open the hash table up to potential DoS attacks

Basically the same applies to inserting arbitrary finite float values (or arbitrary values of any type). The only difference is that even though the range of (random) hash values is finite, NaN's inequality effectively allows an infinite number of keys. But in reality you'll run into practical issues much sooner than this distiction becomes relevant.

@asterite That's not practical. NaN hashes are not just an issue when the key type is a float. Compound types can be used as keys and a NaN value may be sitting anywhere in (nested) fields of type float.