Closed StefanKarpinski closed 6 years ago
Wouldn't this necessitate two versions of Dicts / Sets, one using isequal
and the other using egal
for containment? I would find this behavior to be very confusing. For example is Set([1,1.0])
a single element set?
Edit. I see we already have this behavior.
This is another case that can be justified on the basis that the IEEE standard specifies the behavior of ==
, but not of in
. In other words, the standard doesn't say an in
function must use IEEE ==
to compare elements. Arguably the IEEE NaN behavior should be used as rarely as possible. The case of -0.0 is trickier; 0 in A
not finding a -0.0 could be a problem.
@jakebolewski, Set([1,1.0])
is already a one-element set:
julia> Set([1,1.0])
Set([1.0])
This is a pretty straightforward consequence of using Dict as the foundation for Set. If we wanted the egal behavior, we would need an ObjectIdSet
type to correspond to ObjectIdDict
.
The case of -0.0 is trickier; 0 in A not finding a -0.0 could be a problem.
This kind of sucks, but I think we have to live with it – it's either that or introduce yet another kind of equality, which we can do over my cold, dead body.
We definitely don't need another kind of equality. I think we will opt to keep you alive instead.
I'm relieved.
The other option would be to make -0.0
and 0.0
the same for isequal
and hashing
, but that seems bad too.
Oh, we can't make isequal(-0.0, 0.0)
since that would mean that -0.0
wouldn't sort before 0.0
.
Is it really required that isless
be consistent with isequal
? Anyway with several concepts of equality, you already need to look at the docs to check what it's behavior is.
As a data point, MATLAB's isequal
and ismember
consider -0.0
and 0.0
as equal. R doesn't have isequal
, but its identical
function allows for either behavior via an argument, and its %in%
function considers them as equal. Python's in
does the same. So it doesn't look like distinguishing negative and positive zero is a compelling use-case: it's fine if it's only supported in special dict/set types. On the contrary, distinguishing them by default has the potential of introducing very subtle bugs for the 99% of users (source: classified survey) who didn't even think zero is not always zero.
Is it really required that isless be consistent with isequal?
It's not strictly necessary, but I think anything else would be kind of a disaster, imo.
Interesting data point:
In [19]: np.nan == np.nan
Out[19]: False
In [20]: np.nan in [np.nan]
Out[20]: True
In [21]: -0.0
Out[21]: -0.0
In [22]: 0.0 in [-0.0]
Out[22]: True
I'm not sure how python's in
is handling equality here.
It's not strictly necessary, but I think anything else would be kind of a disaster, imo.
I'm not so sure. -0.0
and 0.0
are sorted next to one another, so considering them as equal does not go against the ordering, it just loses a small bit of information. (I guess there is theory to express that more generally.) In what cases would that be an issue?
One thing worth considering is making isless
match the IEEE totalOrder
predicate: the main difference is that NaN
s with different bit-patterns are distinct, and negatively-signed NaNs precede -Inf
: in a certain sense, this is the ordering corresponding to ===
.
This could be done via
function totalorder(x::Float64,y::Float64)
xi = reinterpret(Int64,x)
yi = reinterpret(Int64,y)
(xi < yi) $ ((xi < 0) & (yi < 0))
end
(which also seems to be faster than our current isless
).
That sort order has always struck me as weird and useless – why would I want some NaNs to come at the beginning and some at the end? If we started doing this, I feel like we'd almost need to start printing NaNs with a sign, which would also be strange. The only advantage I can see to the IEEE ordering is that you can implement it efficiently with integer comparisons.
you can implement it efficiently with integer comparisons.
Yep, that's how to think like a 1980s computer scientist :)
Alright, so I would be ok with making isless
a finer grained ordering than isequal
, namely:
isequal(-0.0, 0.0)
but isless(-0.0, 0.0)
.isequal(nan1, nan2)
for any two NaN values, but have isless
order NaNs by bit pattern.isequal
and isless
.I'm not inclined to have some NaNs negative while others are positive. That's just silly and useless.
Yes, that might be OK. Have to think about it.
By the way, if there are any python experts around I really would like to know what is happening in the example I posted above. It seems in
is not calling __eq__
for np.nan
. However it doesn't use is
either, as this example shows:
In [1]: 1 is 1.0
Out[1]: False
In [2]: 1 == 1.0
Out[2]: True
In [3]: 1 in [1.0]
Out[3]: True
EDIT: Ok, my guess is that in
uses is
first, and returns true if it does, before calling __eq__
.
Oh, Python, you so crazy.
Python's in
is implemented by PySequence_Contains, which defaults to using PyObject_RichCompareBool(obj, item, Py_EQ), which calls the tp_richcompare slot of the type object, which for floating-point types turns into float_richcompare (whose comments begin Comparison is pretty much a nightmare
) which uses the C ==
for two floating-point variables. However, PyObject_RichCompareBool
first checks for object identity in which case it just skips the comparison. (See also the documentation for PyObject_RichCompareBool, which actually mentions this.)
So, Jeff's guess is correct.
(Yes, I've spent far too much time looking at the CPython codebase for someone who never programs in Python.)
Clearly, I used up my "oh, Python, you so crazy" way too early in this thread.
This is an abstract from Python's documentation for comparison operators:
>>> help('in')
...
Comparison of objects of the same type depends on the type:
...
* The values "float('NaN')" and "Decimal('NaN')" are special. The
are identical to themselves, "x is x" but are not equal to
themselves, "x != x". Additionally, comparing any value to a
not-a-number value will return "False". For example, both "3 <
float('NaN')" and "float('NaN') < 3" will return "False".
...
The operators "in" and "not in" test for membership. "x in s"
evaluates to true if *x* is a member of *s*, and false otherwise. "x
not in s" returns the negation of "x in s". All built-in sequences
and set types support this as well as dictionary, for which "in" tests
whether the dictionary has a given key. For container types such as
list, tuple, set, frozenset, dict, or collections.deque, the
expression "x in y" is equivalent to "any(x is e or x == e for e in
y)".
For the string and bytes types, "x in y" is true if and only if *x* is
a substring of *y*. An equivalent test is "y.find(x) != -1". Empty
strings are always considered to be a substring of any other string,
so """ in "abc"" will return "True".
For user-defined classes which define the "__contains__()" method, "x
in y" is true if and only if "y.__contains__(x)" is true.
For user-defined classes which do not define "__contains__()" but do
define "__iter__()", "x in y" is true if some value "z" with "x == z"
is produced while iterating over "y". If an exception is raised
during the iteration, it is as if "in" raised that exception.
Lastly, the old-style iteration protocol is tried: if a class defines
"__getitem__()", "x in y" is true if and only if there is a non-
negative integer index *i* such that "x == y[i]", and all lower
integer indices do not raise "IndexError" exception. (If any other
exception is raised, it is as if "in" raised that exception).
The operator "not in" is defined to have the inverse true value of
"in".
The operators "is" and "is not" test for object identity: "x is y" is
true if and only if *x* and *y* are the same object. "x is not y"
yields the inverse truth value.
So even while 1
and 1.0
are equal, they are not the same object in memory:
In [1]: 1 is 1.0
Out[1]: False
In [2]: id(1) == id(1.0)
Out[2]: False
@Ismael-VC, so their documentation is wrong, because it effectively uses x is z or x == z
, not x == z
.
Actually, the radix sort function in SortingAlgorithms.jl does use integer comparisons (and some shenanigans) for sorting floating point numbers.
On Wednesday, December 17, 2014, Jeff Bezanson notifications@github.com wrote:
you can implement it efficiently with integer comparisons.
Yep, that's how to think like a 1980s computer scientist :)
— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/9381#issuecomment-67360582.
I think -0.0 is more important than NaN: code that tries to compare with NaN is always going to have problems. Finding 0
in an array containing -0.0
would be what most people want. So I think we have to use either "== or isequal" or just ==
. Using a specific function is definitely simpler, so I say we keep the current behavior.
How would you implement that efficiently for Sets? I feel that as containers arrays should behave like ordered multisets, which to me implies using isequal.
On Jun 22, 2015, at 6:27 PM, Jeff Bezanson notifications@github.com wrote:
I think -0.0 is more important than NaN: code that tries to compare with NaN is always going to have problems. Finding 0 in an array containing -0.0 would be what most people want. So I think we have to use either "== or isequal" or just ==. Using a specific function is definitely simpler, so I say we keep the current behavior.
— Reply to this email directly or view it on GitHub.
I think it's mostly ok for different containers to use different tests. Not finding 0 in an array with -0.0 could cause serious problems.
We could also make isequal(-0.0, 0.0)
.
Yes, @StefanKarpinski's plan at https://github.com/JuliaLang/julia/issues/9381#issuecomment-67361516 sounds the best solution.
Thanks for finding that comment, @nalimilan. To quote it and add to it:
isequal(-0.0, 0.0)
but isless(-0.0, 0.0)
.isequal(nan1, nan2)
for any two NaN values, but have isless
order NaNs by bit pattern.isequal
and isless
.isequal
for collection containment checking (arrays and sets).What I dislike about that is that isequal
and isless
are supposed to be the sane comparisons. At this point I have half a mind to make NaN==NaN and -0.0==0.0 and remove isequal
. Another option is to disallow NaN as a hash key.
Does the proposed change make them insane comparisons?
@StefanKarpinski Pun intended? I think he meant "the same comparisons".
I'm not sure that's really a requirement, though. And wouldn't NaN == NAN
go against common expectations?
wouldn't NaN == NaN go against common expectations?
It's a funny way of putting it, I see what you mean but still, "common expectations" might include reflexivity of equality :-). I don't do fp math so I don't know how much people are attached to this part of the IEEE semantics.
Does the proposed change make them insane comparisons?
I would say so, yes. I definitely meant "sane" and not "same". The idea of isequal
and isless
was to have well-behaved equal and less predicates, obeying as many of the usual mathematical properties as possible.
I think I'd be ok with isequal(-0.0,0.0)
and !isless(-0.0,0.0)
though. Then isequal only differs for NaN.
In #8463, there was also some discussion of making hash(0.0) == hash(-0.0)
: this would also make the hash function slightly simpler.
I think I'd be ok with
isequal(-0.0,0.0)
and!isless(-0.0,0.0)
though.
I must be missing something, but wouldn't this cause problems with sorting?
wouldn't this cause problems with sorting?
I suppose you'd get zeros and negative zeros in an undefined order. We could provide an IEEE totalOrder predicate as well. But look, IEEE says -0.0 and 0.0 are equal, so what do they expect?
Looking back at the original motivation of the issue, I think one has to distinguish the case of deliberately searching for NaN using NaN in Y
, vs. writing x in y
generically, and considering the behavior when x happens to be a NaN. In the first case, you really need to use isnan
. In the second case, it doesn't matter what happens --- any program that ignores NaNs is going to get weird answers if NaNs arise somewhere.
It would help to disallow NaNs as dict keys. It would make sense to have a test like
if !(key == key)
error("key cannot be reliably identified and so cannot be inserted")
end
Then we could really start considering removing isequal
.
Whatever IEEE says about -0.0 and 0.0 being ==, having you get a many-to-one relationship when inserting something into a set or dict would not be a good thing, IMO. That's losing information. NaN seems kind of like null, do you allow a null key in a set or dict? Also, all NaN's are not alike. I do like the above test to give an error - I think it is consistent with the way null should be handled, i.e. a null value is not == to anything else, even another null, so the above test would handle that consistently.
Of all the possible issues here, I find it especially easy to imagine problems caused by distinguishing -0.0 and 0.0. Say an array has a -0.0, then somebody needs to ask "are there any zeros here?" It's definitely surprising if 0 in a
doesn't find the zero.
With -0.0, you have to be really careful to propagate signs (for example this makes the code in complex.jl much more...complex). If it didn't exist, you'd have to be really careful computing 1/x where x might be -Inf. To me this is a wash.
OK, then why not just say that -0.0 is converted to 0.0 when it is used as a key? Instead of trying to do something tricky when looking for it (making hashes equal even though the bit patterns are different, etc.)
According to https://github.com/JuliaLang/julia/issues/9381#issuecomment-114568028 it's not any trouble to make -0.0 and 0.0 hash equally, given what our hash function needs to do anyway.
Normalizing -0.0 to 0.0 on key insertion might be a good idea though.
I think that's easier to explain to people (it's the approach we took for -0.0, and didn't have any complaints).
@JeffBezanson, if you're going to call something crazy, it's helpful if you explain why you think it's crazy. Is it the ordering of NaNs that you object to? I have to assume so since you've said that making isequal(-0.0, 0.0)
seems acceptable to you. The NaNs part is not that necessary, but it would make sorting floats much simpler since you wouldn't need to use a stable partition algorithm when moving the NaNs to the end.
Having isequal(-0.0, 0.0)
and isless(-0.0, 0.0)
shouldn't be a problem for sorting since we never use isequal
when sorting, we only use isless
. In general, it's fine for isequal
to be more fine-grained than isequal
, but you want isequal
and hash
to agree in the sense that isequal(x,y)
implies hash(x) == hash(y)
and the inverse implication holds except for collisions, which should be hard to construct.
I really dislike the idea of converting -0.0 to 0.0 upon using it as a key value. It would then be the only value we convert to a different value of the same type when using it as a key. I dislike the idea of disallowing NaN as a key value even more. That smacks of Matlab-style pointless exceptionalism.
I did say why:
The idea of isequal and isless was to have well-behaved equal and less predicates, obeying as many of the usual mathematical properties as possible.
I think it's fair to say that ideally, we would only have == and <. It's frustrating to add an extra set of comparison functions, and still not have them obey expected properties.
The idea is not to disallow NaN specifically, but to sanity check that x == x
holds. Of course NaN is the only value we know of that violates that, but it seems more justifiable than checking isnan
.
If you have a SQL style null value, it needs the same treatment, as x != x.
I disagree that the point of isequal
and isless
is to "obey as many of the usual mathematical properties as possible". The point of isequal
is to be used for hashing while the point of isless
is to be used for ordering things – if they fail at this then you'd need yet another pair of relations for those purposes. The whole notion of "obey the expected properties" is ill defined – different people expect different things. I think that we should make sure that isequal
and isless
are good for hashing and ordering and live with ==
and <
as the day-to-day mathematical operators, despite their IEEE oddities.
In this democracy that is Julia, I'll vote against disallowing NaN as a key. I'm worried that it will lead to all kinds of special-casing in user code; eg, I do a group-by on a dataframe column, some of whose values are NaN. I'm expecting a dictionary back whose keys are the unique value in that column, but that would no longer be possible.
I agree – disallowing NaN as a hash key would be a disaster.
This is how Dicts work, so there's a case to me made. Came up in this discussion:
https://groups.google.com/forum/#!topic/julia-users/XZDm57yHc5M
Motivating examples:
It seems like in for arrays is the one that's out of line here.