JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.59k stars 5.48k forks source link

should `in` use `isequal` to test for containment? #9381

Closed StefanKarpinski closed 6 years ago

StefanKarpinski commented 9 years ago

This is how Dicts work, so there's a case to me made. Came up in this discussion:

https://groups.google.com/forum/#!topic/julia-users/XZDm57yHc5M

Motivating examples:

julia> NaN in [1,1.5,2,NaN]
false

julia> NaN in Set([1,1.5,2,NaN])
true

julia> NaN in keys(Dict(1=>"one", 1.5=>"one and a half", 2=>"two", NaN=>"not a number"))
true

It seems like in for arrays is the one that's out of line here.

jakebolewski commented 9 years ago

Wouldn't this necessitate two versions of Dicts / Sets, one using isequal and the other using egal for containment? I would find this behavior to be very confusing. For example is Set([1,1.0]) a single element set?

Edit. I see we already have this behavior.

JeffBezanson commented 9 years ago

This is another case that can be justified on the basis that the IEEE standard specifies the behavior of ==, but not of in. In other words, the standard doesn't say an in function must use IEEE == to compare elements. Arguably the IEEE NaN behavior should be used as rarely as possible. The case of -0.0 is trickier; 0 in A not finding a -0.0 could be a problem.

StefanKarpinski commented 9 years ago

@jakebolewski, Set([1,1.0]) is already a one-element set:

julia> Set([1,1.0])
Set([1.0])

This is a pretty straightforward consequence of using Dict as the foundation for Set. If we wanted the egal behavior, we would need an ObjectIdSet type to correspond to ObjectIdDict.

StefanKarpinski commented 9 years ago

The case of -0.0 is trickier; 0 in A not finding a -0.0 could be a problem.

This kind of sucks, but I think we have to live with it – it's either that or introduce yet another kind of equality, which we can do over my cold, dead body.

JeffBezanson commented 9 years ago

We definitely don't need another kind of equality. I think we will opt to keep you alive instead.

StefanKarpinski commented 9 years ago

I'm relieved.

StefanKarpinski commented 9 years ago

The other option would be to make -0.0 and 0.0 the same for isequal and hashing, but that seems bad too.

StefanKarpinski commented 9 years ago

Oh, we can't make isequal(-0.0, 0.0) since that would mean that -0.0 wouldn't sort before 0.0.

nalimilan commented 9 years ago

Is it really required that isless be consistent with isequal? Anyway with several concepts of equality, you already need to look at the docs to check what it's behavior is.

As a data point, MATLAB's isequal and ismember consider -0.0 and 0.0 as equal. R doesn't have isequal, but its identical function allows for either behavior via an argument, and its %in% function considers them as equal. Python's in does the same. So it doesn't look like distinguishing negative and positive zero is a compelling use-case: it's fine if it's only supported in special dict/set types. On the contrary, distinguishing them by default has the potential of introducing very subtle bugs for the 99% of users (source: classified survey) who didn't even think zero is not always zero.

StefanKarpinski commented 9 years ago

Is it really required that isless be consistent with isequal?

It's not strictly necessary, but I think anything else would be kind of a disaster, imo.

JeffBezanson commented 9 years ago

Interesting data point:

In [19]: np.nan == np.nan
Out[19]: False

In [20]: np.nan in [np.nan]
Out[20]: True

In [21]: -0.0
Out[21]: -0.0

In [22]: 0.0 in [-0.0]
Out[22]: True

I'm not sure how python's in is handling equality here.

nalimilan commented 9 years ago

It's not strictly necessary, but I think anything else would be kind of a disaster, imo.

I'm not so sure. -0.0 and 0.0 are sorted next to one another, so considering them as equal does not go against the ordering, it just loses a small bit of information. (I guess there is theory to express that more generally.) In what cases would that be an issue?

simonbyrne commented 9 years ago

One thing worth considering is making isless match the IEEE totalOrder predicate: the main difference is that NaNs with different bit-patterns are distinct, and negatively-signed NaNs precede -Inf: in a certain sense, this is the ordering corresponding to ===.

This could be done via

function totalorder(x::Float64,y::Float64)
    xi = reinterpret(Int64,x)
    yi = reinterpret(Int64,y)
    (xi < yi) $ ((xi < 0) & (yi < 0))
end

(which also seems to be faster than our current isless).

StefanKarpinski commented 9 years ago

That sort order has always struck me as weird and useless – why would I want some NaNs to come at the beginning and some at the end? If we started doing this, I feel like we'd almost need to start printing NaNs with a sign, which would also be strange. The only advantage I can see to the IEEE ordering is that you can implement it efficiently with integer comparisons.

JeffBezanson commented 9 years ago

you can implement it efficiently with integer comparisons.

Yep, that's how to think like a 1980s computer scientist :)

StefanKarpinski commented 9 years ago

Alright, so I would be ok with making isless a finer grained ordering than isequal, namely:

  1. isequal(-0.0, 0.0) but isless(-0.0, 0.0).
  2. isequal(nan1, nan2) for any two NaN values, but have isless order NaNs by bit pattern.
  3. all other equivalence classes are the same for isequal and isless.

I'm not inclined to have some NaNs negative while others are positive. That's just silly and useless.

JeffBezanson commented 9 years ago

Yes, that might be OK. Have to think about it.

By the way, if there are any python experts around I really would like to know what is happening in the example I posted above. It seems in is not calling __eq__ for np.nan. However it doesn't use is either, as this example shows:

In [1]: 1 is 1.0
Out[1]: False

In [2]: 1 == 1.0
Out[2]: True

In [3]: 1 in [1.0]
Out[3]: True

EDIT: Ok, my guess is that in uses is first, and returns true if it does, before calling __eq__.

StefanKarpinski commented 9 years ago

Oh, Python, you so crazy.

stevengj commented 9 years ago

Python's in is implemented by PySequence_Contains, which defaults to using PyObject_RichCompareBool(obj, item, Py_EQ), which calls the tp_richcompare slot of the type object, which for floating-point types turns into float_richcompare (whose comments begin Comparison is pretty much a nightmare) which uses the C == for two floating-point variables. However, PyObject_RichCompareBool first checks for object identity in which case it just skips the comparison. (See also the documentation for PyObject_RichCompareBool, which actually mentions this.)

So, Jeff's guess is correct.

(Yes, I've spent far too much time looking at the CPython codebase for someone who never programs in Python.)

StefanKarpinski commented 9 years ago

Clearly, I used up my "oh, Python, you so crazy" way too early in this thread.

Ismael-VC commented 9 years ago

This is an abstract from Python's documentation for comparison operators:

>>> help('in')

...

Comparison of objects of the same type depends on the type:

...

* The values "float('NaN')" and "Decimal('NaN')" are special. The
are identical to themselves, "x is x" but are not equal to
themselves, "x != x".  Additionally, comparing any value to a
not-a-number value will return "False".  For example, both "3 <
float('NaN')" and "float('NaN') < 3" will return "False".

...

The operators "in" and "not in" test for membership.  "x in s"
evaluates to true if *x* is a member of *s*, and false otherwise.  "x
not in s" returns the negation of "x in s".  All built-in sequences
and set types support this as well as dictionary, for which "in" tests
whether the dictionary has a given key. For container types such as
list, tuple, set, frozenset, dict, or collections.deque, the
expression "x in y" is equivalent to "any(x is e or x == e for e in
y)".

For the string and bytes types, "x in y" is true if and only if *x* is
a substring of *y*.  An equivalent test is "y.find(x) != -1".  Empty
strings are always considered to be a substring of any other string,
so """ in "abc"" will return "True".

For user-defined classes which define the "__contains__()" method, "x
in y" is true if and only if "y.__contains__(x)" is true.

For user-defined classes which do not define "__contains__()" but do
define "__iter__()", "x in y" is true if some value "z" with "x == z"
is produced while iterating over "y".  If an exception is raised
during the iteration, it is as if "in" raised that exception.

Lastly, the old-style iteration protocol is tried: if a class defines
"__getitem__()", "x in y" is true if and only if there is a non-
negative integer index *i* such that "x == y[i]", and all lower
integer indices do not raise "IndexError" exception.  (If any other
exception is raised, it is as if "in" raised that exception).

The operator "not in" is defined to have the inverse true value of
"in".

The operators "is" and "is not" test for object identity: "x is y" is
true if and only if *x* and *y* are the same object.  "x is not y"
yields the inverse truth value.

So even while 1 and 1.0 are equal, they are not the same object in memory:

In [1]: 1 is 1.0
Out[1]: False

In [2]: id(1) == id(1.0)
Out[2]: False
stevengj commented 9 years ago

@Ismael-VC, so their documentation is wrong, because it effectively uses x is z or x == z, not x == z.

kmsquire commented 9 years ago

Actually, the radix sort function in SortingAlgorithms.jl does use integer comparisons (and some shenanigans) for sorting floating point numbers.

On Wednesday, December 17, 2014, Jeff Bezanson notifications@github.com wrote:

you can implement it efficiently with integer comparisons.

Yep, that's how to think like a 1980s computer scientist :)

— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/9381#issuecomment-67360582.

JeffBezanson commented 9 years ago

I think -0.0 is more important than NaN: code that tries to compare with NaN is always going to have problems. Finding 0 in an array containing -0.0 would be what most people want. So I think we have to use either "== or isequal" or just ==. Using a specific function is definitely simpler, so I say we keep the current behavior.

StefanKarpinski commented 9 years ago

How would you implement that efficiently for Sets? I feel that as containers arrays should behave like ordered multisets, which to me implies using isequal.

On Jun 22, 2015, at 6:27 PM, Jeff Bezanson notifications@github.com wrote:

I think -0.0 is more important than NaN: code that tries to compare with NaN is always going to have problems. Finding 0 in an array containing -0.0 would be what most people want. So I think we have to use either "== or isequal" or just ==. Using a specific function is definitely simpler, so I say we keep the current behavior.

— Reply to this email directly or view it on GitHub.

JeffBezanson commented 9 years ago

I think it's mostly ok for different containers to use different tests. Not finding 0 in an array with -0.0 could cause serious problems.

StefanKarpinski commented 9 years ago

We could also make isequal(-0.0, 0.0).

nalimilan commented 9 years ago

Yes, @StefanKarpinski's plan at https://github.com/JuliaLang/julia/issues/9381#issuecomment-67361516 sounds the best solution.

StefanKarpinski commented 9 years ago

Thanks for finding that comment, @nalimilan. To quote it and add to it:

  1. isequal(-0.0, 0.0) but isless(-0.0, 0.0).
  2. isequal(nan1, nan2) for any two NaN values, but have isless order NaNs by bit pattern.
  3. all other equivalence classes are the same for isequal and isless.
  4. use isequal for collection containment checking (arrays and sets).
JeffBezanson commented 9 years ago

What I dislike about that is that isequal and isless are supposed to be the sane comparisons. At this point I have half a mind to make NaN==NaN and -0.0==0.0 and remove isequal. Another option is to disallow NaN as a hash key.

StefanKarpinski commented 9 years ago

Does the proposed change make them insane comparisons?

nalimilan commented 9 years ago

@StefanKarpinski Pun intended? I think he meant "the same comparisons".

I'm not sure that's really a requirement, though. And wouldn't NaN == NAN go against common expectations?

carnaval commented 9 years ago

wouldn't NaN == NaN go against common expectations?

It's a funny way of putting it, I see what you mean but still, "common expectations" might include reflexivity of equality :-). I don't do fp math so I don't know how much people are attached to this part of the IEEE semantics.

JeffBezanson commented 9 years ago

Does the proposed change make them insane comparisons?

I would say so, yes. I definitely meant "sane" and not "same". The idea of isequal and isless was to have well-behaved equal and less predicates, obeying as many of the usual mathematical properties as possible.

JeffBezanson commented 9 years ago

I think I'd be ok with isequal(-0.0,0.0) and !isless(-0.0,0.0) though. Then isequal only differs for NaN.

simonbyrne commented 9 years ago

In #8463, there was also some discussion of making hash(0.0) == hash(-0.0): this would also make the hash function slightly simpler.

kmsquire commented 9 years ago

I think I'd be ok with isequal(-0.0,0.0) and !isless(-0.0,0.0) though.

I must be missing something, but wouldn't this cause problems with sorting?

JeffBezanson commented 9 years ago

wouldn't this cause problems with sorting?

I suppose you'd get zeros and negative zeros in an undefined order. We could provide an IEEE totalOrder predicate as well. But look, IEEE says -0.0 and 0.0 are equal, so what do they expect?

Looking back at the original motivation of the issue, I think one has to distinguish the case of deliberately searching for NaN using NaN in Y, vs. writing x in y generically, and considering the behavior when x happens to be a NaN. In the first case, you really need to use isnan. In the second case, it doesn't matter what happens --- any program that ignores NaNs is going to get weird answers if NaNs arise somewhere.

It would help to disallow NaNs as dict keys. It would make sense to have a test like

if !(key == key)
    error("key cannot be reliably identified and so cannot be inserted")
end

Then we could really start considering removing isequal.

ScottPJones commented 9 years ago

Whatever IEEE says about -0.0 and 0.0 being ==, having you get a many-to-one relationship when inserting something into a set or dict would not be a good thing, IMO. That's losing information. NaN seems kind of like null, do you allow a null key in a set or dict? Also, all NaN's are not alike. I do like the above test to give an error - I think it is consistent with the way null should be handled, i.e. a null value is not == to anything else, even another null, so the above test would handle that consistently.

JeffBezanson commented 9 years ago

Of all the possible issues here, I find it especially easy to imagine problems caused by distinguishing -0.0 and 0.0. Say an array has a -0.0, then somebody needs to ask "are there any zeros here?" It's definitely surprising if 0 in a doesn't find the zero.

With -0.0, you have to be really careful to propagate signs (for example this makes the code in complex.jl much more...complex). If it didn't exist, you'd have to be really careful computing 1/x where x might be -Inf. To me this is a wash.

ScottPJones commented 9 years ago

OK, then why not just say that -0.0 is converted to 0.0 when it is used as a key? Instead of trying to do something tricky when looking for it (making hashes equal even though the bit patterns are different, etc.)

JeffBezanson commented 9 years ago

According to https://github.com/JuliaLang/julia/issues/9381#issuecomment-114568028 it's not any trouble to make -0.0 and 0.0 hash equally, given what our hash function needs to do anyway.

JeffBezanson commented 9 years ago

Normalizing -0.0 to 0.0 on key insertion might be a good idea though.

ScottPJones commented 9 years ago

I think that's easier to explain to people (it's the approach we took for -0.0, and didn't have any complaints).

StefanKarpinski commented 9 years ago

@JeffBezanson, if you're going to call something crazy, it's helpful if you explain why you think it's crazy. Is it the ordering of NaNs that you object to? I have to assume so since you've said that making isequal(-0.0, 0.0) seems acceptable to you. The NaNs part is not that necessary, but it would make sorting floats much simpler since you wouldn't need to use a stable partition algorithm when moving the NaNs to the end.

Having isequal(-0.0, 0.0) and isless(-0.0, 0.0) shouldn't be a problem for sorting since we never use isequal when sorting, we only use isless. In general, it's fine for isequal to be more fine-grained than isequal, but you want isequal and hash to agree in the sense that isequal(x,y) implies hash(x) == hash(y) and the inverse implication holds except for collisions, which should be hard to construct.

I really dislike the idea of converting -0.0 to 0.0 upon using it as a key value. It would then be the only value we convert to a different value of the same type when using it as a key. I dislike the idea of disallowing NaN as a key value even more. That smacks of Matlab-style pointless exceptionalism.

JeffBezanson commented 9 years ago

I did say why:

The idea of isequal and isless was to have well-behaved equal and less predicates, obeying as many of the usual mathematical properties as possible.

I think it's fair to say that ideally, we would only have == and <. It's frustrating to add an extra set of comparison functions, and still not have them obey expected properties.

The idea is not to disallow NaN specifically, but to sanity check that x == x holds. Of course NaN is the only value we know of that violates that, but it seems more justifiable than checking isnan.

ScottPJones commented 9 years ago

If you have a SQL style null value, it needs the same treatment, as x != x.

StefanKarpinski commented 9 years ago

I disagree that the point of isequal and isless is to "obey as many of the usual mathematical properties as possible". The point of isequal is to be used for hashing while the point of isless is to be used for ordering things – if they fail at this then you'd need yet another pair of relations for those purposes. The whole notion of "obey the expected properties" is ill defined – different people expect different things. I think that we should make sure that isequal and isless are good for hashing and ordering and live with == and < as the day-to-day mathematical operators, despite their IEEE oddities.

malmaud commented 9 years ago

In this democracy that is Julia, I'll vote against disallowing NaN as a key. I'm worried that it will lead to all kinds of special-casing in user code; eg, I do a group-by on a dataframe column, some of whose values are NaN. I'm expecting a dictionary back whose keys are the unique value in that column, but that would no longer be possible.

StefanKarpinski commented 9 years ago

I agree – disallowing NaN as a hash key would be a disaster.