JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.45k stars 5.46k forks source link

make numbers non-iterable? #7903

Closed StefanKarpinski closed 7 years ago

StefanKarpinski commented 10 years ago

@StephenVavasis has pointed out some rather confusing behavior of the in operator, including:

julia> VERSION
v"0.3.0-rc2+12"

julia> x = IntSet([3,5])
IntSet([3, 5])

julia> in(3,x)
true

julia> in(x,3)
false

julia> in("abc",19)
false

julia> in(19,"abc")
false

Worse still is this:

julia> 97 in "abc"
true

This issue is to discuss what, if anything, we can do to reduce some of this confusion.

JeffBezanson commented 10 years ago

x in y is pretty simple: does x == any element iterated by y.

The issue is not specific to in; 97 == 'a' is true.

I do think it would be good to make Char not an integer type. In theory having numbers not be iterable might be ok, but I think it will be very annoying in practice.

see also #5844

StephenVavasis commented 10 years ago

So I guess the problem is not specifically with `in' but with ==. I raised the issue that apparently meaningless uses of 'in' do not generate errors because I made silly blunder with 'in' in my project that did not generate an error message. For a similar reason (ease of debugging) the following uses of == should also generate errors, no?

julia> [3,4,5] == ["x","y"] false

julia> IntSet() == ["a"=>7] false

julia> 9 == Int64 false

julia>

JeffBezanson commented 10 years ago

Our == is total, defined on all pairs of values. I find this convenient, and I think it is quite difficult to decide exactly which cases to exclude. On Aug 7, 2014 8:15 PM, "StephenVavasis" notifications@github.com wrote:

So I guess the problem is not specifically with `in' but with ==. I raised the issue that apparently meaningless uses of 'in' do not generate errors because I made silly blunder with 'in' in my project that did not generate an error message. For a similar reason (ease of debugging) the following uses of == should also generate errors, no?

julia> [3,4,5] == ["x","y"] false

julia> IntSet() == ["a"=>7] false

julia> 9 == Int64 false

julia>

— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/7903#issuecomment-51549500.

StephenVavasis commented 10 years ago

Jeff,

I guess it's OK to make == total, but there also ought to be a restricted version of == that requires the two operands to have the same type in order to catch programmer blunders. I would guess that it is rare among scientific codes to have a need for == with unequal types. Is there such a restricted version available in Julia? (And is there a similarly restricted version of 'in'?)

-- Steve

On Thu, 7 Aug 2014, Jeff Bezanson wrote:

Our == is total, defined on all pairs of values. I find this convenient, and I think it is quite difficult to decide exactly which cases to exclude. On Aug 7, 2014 8:15 PM, "StephenVavasis" notifications@github.com wrote:

So I guess the problem is not specifically with `in' but with ==. I raised the issue that apparently meaningless uses of 'in' do not generate errors because I made silly blunder with 'in' in my project that did not generate an error message. For a similar reason (ease of debugging) the following uses of == should also generate errors, no?

julia> [3,4,5] == ["x","y"] false

julia> IntSet() == ["a"=>7] false

julia> 9 == Int64 false

julia>

— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/7903#issuecomment-51549500.

— Reply to this email directly or view it onGitHub.[7881863__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyMzA3NzY0NywiZGF0YSI6eyJ pZCI6MzkxNDA5MjB9fQ==--26abb0bc2f6cb83bf6826a0450d494ff66b93a18.gif]

timholy commented 10 years ago

Three =:

julia> 5 == 5.0
true

julia> 5 === 5.0
false
jey commented 10 years ago

However that is for object identity, not equality. I.e. (zeros(3) === zeros(3)) is false

On Thursday, August 7, 2014, Tim Holy notifications@github.com wrote:

Three =:

julia> 5 == 5.0true julia> 5 === 5.0false

— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/7903#issuecomment-51553058.

tkelman commented 10 years ago

One option is you can pick your favorite unicode equality-resembling symbol from this list https://github.com/JuliaLang/julia/blob/31eb8d432bc69f617091fbc0406d3aab442360e9/src/julia-parser.scm#L13 that is parsed with the same precedence as == but open for user definitions, and make that operator error on inputs of different types.

StephenVavasis commented 10 years ago

Earlier I wrote that it seems OK for == to work for all possible operands, but now I changed my mind. The (small) increase in expressive power does not offset the potential for enabling programmer blunders. One important mission of a programming language is to help prevent the programmer from shooting himself/herself in the foot, and in this case Julia needlessly fails to close a loophole.

About 25 years ago when I was a CS prof at Cornell, the issue came up (again) whether to switch our introductory programming course to C, and our faculty unanimously rejected the idea (again) for many reasons. One of the reasons was that we did not savor the possibility of undergrads lined up outside the TA office for help with their assignments because they wrote "if (x=0)" in their homework (instead of "if (x==0)") . The same rationale still applies to Julia 25 years later; it should not be possible to compile a program with in(x,3) where x is an IntSet, nor a program with x==t, where x is an Int and t is a Dict.

nalimilan commented 10 years ago

It seems it would be better to add restrictions to in than to ==. For example, in could require that the second argument is a collection rather than a scalar.

I also agree that Char shouldn't be considered as an integer type. There's Uint8 for this use case.

JeffBezanson commented 10 years ago

I find the case for restricting in much more convincing than for ==.

With ==, people will write code to look for a certain value, like x == 0. In a very generic context, you might not know what x could be. This applies even more for sentinel values, e.g. x == :none. It would be too much of a gotcha to require also checking that the comparison will be valid, e.g. if isa(x,Symbol) && x == :none. Also imagine a Dict{Any,Any}, where arbitrary keys need to be compared.

In other words, it would be too difficult to get back the current behavior. You would need something like if comparable(x,y) && x==y, but it's not clear how to write comparable. In contrast, it is currently fairly easy to add extra restrictions if you want.

ckhroulev commented 10 years ago

Another (unrelated) issue in() seems to have is that in(x, s::Set) calls haskey to check equality (see set.jl:16), ignoring user-defined == and isequal.

Please let me know what you think about this, and feel free to move this elsewhere if appropriate.

Here's an example:

julia> VERSION
v"0.2.1"

julia> immutable Edge # edges of an undirected graph
       a :: Integer
       b :: Integer
       end

julia> ==(a::Edge, b::Edge) = (a.a == b.a && a.b == b.b) || (a.a == b.b && a.b == b.a)
== (generic function with 47 methods)

julia> import Base.isequal

julia> isequal(a::Edge, b::Edge) = a == b
isequal (generic function with 34 methods)

julia> edges = [Edge(1,2), Edge(2,3), Edge(3,1), Edge(1,3)]
4-element Array{Edge,1}:
 Edge(1,2)
 Edge(2,3)
 Edge(3,1)
 Edge(1,3)

Now, compare

julia> in(Edge(2,1), edges)
true

and

julia> in(Edge(2,1), Set(edges))
false

I did not expect this.

This also breaks unique (try unique(edges)), which uses in(x, s::Set) internally.

In this particular case a simple workaround is to define the (inner) constructor

Edge(a::Integer, b::Integer) = a < b ? new(a,b) : new(b,a)

and use default == and isequal, but a solution like this one may not be possible in for other user-defined types. (This reminds me of a discussion of the relationship between "a == b" and "hash(a) == hash(b)".)

PS: As far as I can tell recent code in the master branch should show the same behavior.

JeffBezanson commented 10 years ago

You also need to implement hash for sets and dicts to work properly with user-defined ==.

jey commented 10 years ago

Hm, perhaps Set should be called HashSet to preserve the mathematical meaning of "Set"? Overall, the mathy types aren't fully internally consistent, IMO.

JeffBezanson commented 10 years ago

Could you elaborate on what is a "mathy" type, and what the problems are?

StephenVavasis commented 10 years ago

Let me make the following proposal: instead of redefining ==, how about if you redefine isequal(.,.) so that it is valid only when its operands are the same type (of course extensible by the programmer if necessary). Then you redefine 'in' to apply isequal instead of ==. Finally, you make it clear in the documentation that isequal(.,.) is specifically intended for use with containers (the documentation already says that it is useful for sorting). Furthermore, containers that use isequal might impose additional restrictions on keys. For example, Dict() and Set() impose the restriction that isequal(a,b) implies isequal(hash(a),hash(b)). For the containers I am developing in my current project (OrderedDict, MultiMap and OrderedSet), the restriction is that isequal(x,y) is true if and only if isless(x,y) and isless(y,x) are both false (i.e., isless defines a total order on the keys).

elextr commented 10 years ago

Isn't == defined in terms of isequal()? Forcing same type means you lose the ability to compare numbers of differing types but the same numeric values as equal.

StefanKarpinski commented 10 years ago

isequal is specifically for hashing, which has to support cross-type comparisons, so that's a nonstarter.

JeffBezanson commented 10 years ago

We have actually put a lot of time and thought into equality, and we've tried a couple variants. Having too many different kinds of equality (common lisp has at least 5) gets difficult to manage. Stefan developed a clever hash function that's able to efficiently hash equal numbers equally, even if they're of different types. Since then we've enjoyed the luxury of == and isequal being identical except for NaN and -0.0. isequal used to have various different behaviors, but at this point it's a bit difficult to remember what they were, since life is so much simpler now.

quinnj commented 10 years ago

Steven,

It would actually be pretty trivial to implement what you're after if you'd like it for you own code at least. I admit that though at first realizing different types could be equal was a little jarring, it's made a difference surprisingly little in actual code I've written. My two cents.

typesequal{T}(x::T, y::T) = x == y typesequal(x, y) = error("Types of $x and $y don't match")

-Jacob

On Sat, Aug 9, 2014 at 10:00 PM, Jeff Bezanson notifications@github.com wrote:

We have actually put a lot of time and thought into equality, and we've tried a couple variants. Having too many different kinds of equality (common lisp has at least 5) gets difficult to manage. Stefan developed a clever hash function that's able to efficiently hash equal numbers equally, even if they're of different types. Since then we've enjoyed the luxury of == and isequal being identical except for NaN and -0.0. isequal used to have various different behaviors, but at this point it's a bit difficult to remember what they were, since life is so much simpler now.

— Reply to this email directly or view it on GitHub https://github.com/JuliaLang/julia/issues/7903#issuecomment-51703834.

StephenVavasis commented 10 years ago

Jacob,

I understand that I could implement this myself, but that is not the point I'm trying to make. I'm trying to say that Julia is flawed if the statement in(x,3) does not cause an error message, where x is an IntSet, because it is failing its mission to catch programmer blunders. (See my earlier posting on this thread.)

I am working on a Julia project and would like to 'buy in' to Julia, but obviously I am concerned that it might not catch on, especially at the level of university instruction, if it has flaws like this.

Furthermore, I think that this conversation reveals a problem with Set and in(): they are trying to solve too many problems at once. The application of Set mentioned by Constantine in an earlier message in this thread is more likely to appear in a scientific code than an application in which Set holds objects of varying types. Therefore, either Set should be redesigned to be more appropriate for Constantine's example, or else it should be split into two separate container types, say ArbitrarySet and TypedSet.

StefanKarpinski commented 10 years ago

Blowing this one relatively minor issue up to the success or failure of the entire project is a bit hyperbolic. But yes, it would be nice to catch common usage mistakes better here – that's why I opened the issue to discuss it. Making isequal error on different types is not good, but I think that throwing an error for x in C unless isa(x, eltype(C)) might be acceptable. Unfortunately, we can't rely on all possible iterable things having an eltype method, so you can't just do that. You can do something like what collect does where you check applicability of eltype first and only do the test if it's applicable. I know Jeff hates that kind of thing though.

JeffBezanson commented 10 years ago

Checking isa(x, eltype(C)) is a decent idea. A while ago we added a fallback for eltype so it is always at least Any. It's difficult to express just how in should be restricted, if at all, so this suggestion has merit.

However, in my view it's always at least valid to ask whether an item is in some collection. It shouldn't be ok to ask 1 in {1.0,2.0}, but not 1 in Float64[1.0,2.0].

Python doesn't even have typed containers (except in numpy), and it has certainly caught on for university teaching. It deals with x in 3 by making integers not iterable, which is a reasonable choice.

What happened in julia is that integers can be used as indexes, of course, and we describe indexing as iterating over all of the indexes. Therefore integers became iterable. The rest is a natural consequence. If I had a value n that I knew was being used as an index, and I wanted to know whether it would touch position i, I could ask i in n. In that case i in 3 makes sense. So these things are not necessarily random mistakes, but tend to be consequences of certain very general ideas.

Therefore any realistic "fix" for these things needs to go back and revise the underlying ideas. For example, we could entertain a proposal where numbers are not iterable, and indexing works on some other principle. Maybe scalars should be promoted to 0-d arrays before being used as indexes, or maybe we iterate over not an index itself, but over an iterator returned by toindex(x), or something of that ilk. These approaches have so far struck me as more complex, and possibly less efficient, than just making numbers iterable. But I'm open to suggestions.

StephenVavasis commented 10 years ago

Stefan,

Yes, I agree, I exaggerated the significance of one issue. But it is not an exaggeration to say that preventing programmer blunders is a more important goal for a programming language than maximizing generality.

Second, I would like to point out that the following test appears to indicate that Julia is following the isequal rule rather than the == rule for set membership:

julia> s = Set([5.0, -0.0])
Set{Float64}({-0.0,5.0})
julia> in(5.0,s)
true
julia> in(0.0,s)
false
julia> 0.0 == -0.0
true
julia> isequal(-0.0,0.0)
false

Finally, let me put forward yet another proposal for fixing this problem: make an additional, more complicated constructor for Set that takes a function, like this:

Set(my_isequal, [ <initial entries> ])

or perhaps even two functions:

Set(my_isequal, my_hash, [<initial entries>])

C++ lets you do something analogous with containers. Then IntSet can behave like a Set where my_isequal is specified and would accept only Int arguments.

The in function can use the my_isequal associated with the container that is its second argument.

[edit formatting: @StefanKarpinski]

JeffBezanson commented 10 years ago

Sets and Dicts with custom comparison and hash functions would be a fine feature to have.

StefanKarpinski commented 10 years ago

Sets and Dicts with custom comparison and hash functions would be a fine feature to have.

I disagree. Getting the comparison and hash functions right is incredibly hard. If someone wants to do some different kind of comparison, the way to do it is to transform the keys explicitly before hashing or storing them and then use the normal comparison and hashing. This is simpler, more explicit, and doesn't fail in subtle and confusing ways, which is what would happen with custom comparison and hashing functions.

JeffBezanson commented 10 years ago

I agree that transforming keys is better than using a custom comparison function and I would tend to steer people towards that, but I'm not totally against the custom comparison approach. Maybe it could go in a package, along with things like OrderedDict.

StefanKarpinski commented 10 years ago

A subtler check for the element type is similar to what we do to check if a key is value for a typed Dict – check that isequal(x, convert(eltype(C), x)). This catches both the IntSet([3,5]) in 3 and the "abc" in 19 cases. It doesn't help with characters being considered integers, but that's a different issue. We'd want to catch no method errors for the conversion and render them as type mismatches.

JeffBezanson commented 10 years ago

Using the same check that Dict uses is a good thought. However Dicts only use this on assignment, not on lookup.

Maybe a more surgical approach is needed: only allow the second argument to in to be a Number if the first argument is too. We might also want to allow only in(::Char,::String) for strings.

Taken a bit further, this could argue for removing the fully-generic in, or adding a Collection type and having the generic in require it for the second argument.

StefanKarpinski commented 10 years ago

Dicts could also throw an error if you use a nonsensical key – i.e. do that same check on lookup.

JeffBezanson commented 10 years ago

I think it still might be too strict. It's also too container-type-dependent: a Dict with only Int keys should be representable as a Dict{Int,_} without causing problems, at least in the absence of mutation.

StephenVavasis commented 10 years ago

In an earlier message in this thread, Stefan wrote that it is 'incredibly hard' to write comparison and hash functions. At least for the common cases, it seems relatively straightforward to me. In the case of IntSet, the default comparison and hash work fine except with their arguments limited to Int. In the case of Edge in Constantine's earlier message, it also seems straightforward; Constantine already showed us the isequal method in his posting, and the hash method would either hash (a,b) or (b,a), depending on whether a<=b, using the built-in hash function.

StefanKarpinski commented 10 years ago

It wouldn't be incredible if it was easy to believe how hard it is ;-)

It's fairly straight forward if you're comparing values within a given type – any deterministic function can serve as a hash; as long as your equality relation is compatible with that hash, things are fine. It's when you introduce cross-type comparisons that life gets complicated. Consider writing a cross-type == for Ints and Float64s that's actually an equivalence relation. Hint: converting to Float64 and then checking equality is not transitive. Given a cross-type equivalence relation, how do you compute hash values that are consistent with that relation? This requires a type-agnostic (i.e. strictly value-based) hashing algorithm – but it also needs to be reasonably fast. Maybe these concerns are too esoteric for user-defined hashing and comparisons, but if they aren't handled, the functions will be subtly broken.

StephenVavasis commented 10 years ago

Stefan,

Indeed, it is impressive that you figured out how to implement a set that can contain elements of every possible Julia object and still have fast look-up with hashing. So I guess I retreat to my earlier proposal that there should be two set types, one concrete type like the current Set that can contain anything, and another type of set (an abstract type) whose contents are restricted according to type. This latter type of set is important for catching errors with 'in' and solving Constantine's problem. IntSet should be a concrete implementation of the latter type.

With regard to your earlier suggestion about checking for errors via the 'eltype' function, would this check take place at compile time, or at the time that 'in' is executed? It seems like the former is preferable for performance reasons.

StefanKarpinski commented 10 years ago

Thanks, @StephenVavasis. The eltype check approach would semantically occur at runtime, but the check would be eliminated a lot of the time by type inference. For example, if x::eltype(C) and type inference can prove it, then the convert(eltype(C), x) is a no-op. It's harder to eliminate the isequal check, but I suspect that when the conversion is a no-op, the whole check will disappear. In these simple cases, the check is completely eliminated:

julia> ck(x,C) = isequal(x, convert(eltype(C), x))
ck (generic function with 1 method)

julia> @code_llvm ck(1, [1,2,3])

define i1 @"julia_ck;22392"(i64, %jl_value_t*) {
top:
  ret i1 true, !dbg !8590
}

julia> @code_llvm ck(1.5, [2.5, 3.5])

define i1 @"julia_ck;22405"(double, %jl_value_t*) {
top:
  ret i1 true, !dbg !8629
}

julia> @code_llvm ck('x', "abc")

define i1 @"julia_ck;22399"(i32, %jl_value_t*) {
top:
  ret i1 true, !dbg !8611
}
tkelman commented 8 years ago

I guess to evaluate this, we'd need to try removing start, next, and done for numbers and see what breaks.

StefanKarpinski commented 8 years ago

@mbauman, have you merged the changes you had that make array indexing not rely on this?

mbauman commented 8 years ago

Yes, array indexing no longer requires iterating over numbers — but it does index into them.

I think there's another issue somewhere where I started looking into deprecating this… but quickly hit lots of comprehensions and loops like [x[i,j] for i in 1, j in 1:size(x, 2)]. I think it's fairly easy to do now, it'd just take some janitorial work.

Edit: https://github.com/JuliaLang/julia/pull/10331#issuecomment-77630144

StefanKarpinski commented 8 years ago

I only found a handful of examples of that:

base/abstractarray.jl
611:hcat{T}(X::T...)         = T[ X[j] for i=1, j=1:length(X) ]
612:hcat{T<:Number}(X::T...) = T[ X[j] for i=1, j=1:length(X) ]

base/arraymath.jl
380:transpose(x::AbstractVector) = [ transpose(v) for i=1, v in x ]
381:ctranspose{T}(x::AbstractVector{T}) = T[ ctranspose(v) for i=1, v in x ] #Fixme comprehension

base/range.jl
372:ctranspose(r::Range) = [x for _=1, x=r]
StefanKarpinski commented 8 years ago

After some discussion on a triage call, it's been decided that we should take a crack at this, but everyone has too many things on their plate to do it right now. So if someone wants to do this, it's fairly straightforward: delete all the iterable interface methods for numbers and see what breaks; try fixing stuff that breaks and see how bad it is. Make a WIP PR to report progress and get feedback.

stevengj commented 8 years ago

As I noted in #15032, I think broadcast still needs to work for scalar arguments (since it has to work for mixed scalar/array arguments anyway), but that can be preserved independent of whether scalars are iterable.

And size(x::Number) should still return ().

tkelman commented 8 years ago

Made a quick attempt to see what happened. Concatenation of some array literals has issues when you try to remove start(::Number), like here https://github.com/JuliaLang/julia/blob/e0e93fc85e899105351860172d37b18f51d29183/base/base64.jl#L35 or the following line from uv_constants.jl

const uv_handle_types = [:UV_ASYNC, :UV_CHECK, :UV_FS_EVENT, :UV_FS_POLL, :UV_HANDLE, :UV_IDLE, :UV_NAMED_PIPE, :UV_POLL, :UV_PREPARE, :UV_PROCESS, :UV_STREAM, :UV_TCP, :UV_TIMER, :UV_TTY, :UV_UDP, :UV_SIGNAL, :UV_FILE]
# ...
    handles = [:UV_UNKNOWN_HANDLE; uv_handle_types; :UV_HANDLE_TYPE_MAX; :UV_RAW_FD; :UV_RAW_HANDLE]

edit: whoops, just needed

--- a/base/abstractarray.jl
+++ b/base/abstractarray.jl
@@ -169,6 +169,11 @@ function copy!(dest::AbstractArray, src)
     return dest
 end

+function copy!(dest::AbstractArray, src::Number)
+    dest[1] = src
+    return dest
+end
+
 # if src is not an AbstractArray, moving to the offset might be O(n)
 function copy!(dest::AbstractArray, doffs::Integer, src)
     doffs < 1 && throw(BoundsError(dest, doffs))
JeffBezanson commented 8 years ago

Why would this affect arrays of symbols?

tkelman commented 8 years ago

Probably nothing specific to the symbol eltype, those are just needed during bootstrap. vcat needs length and copy! on its arguments https://github.com/JuliaLang/julia/blob/e0e93fc85e899105351860172d37b18f51d29183/base/abstractarray.jl#L163-L170, see my edit above for a new method that gets past that

JeffBezanson commented 8 years ago

I'm still confused; Symbols don't have length or iteration, so how could they have worked with vcat if that's the case?

tkelman commented 8 years ago

Found it. collect(1) here: https://github.com/JuliaLang/julia/blob/e0e93fc85e899105351860172d37b18f51d29183/base/abstractarray.jl#L706

Another issue is https://github.com/JuliaLang/julia/blob/e0e93fc85e899105351860172d37b18f51d29183/base/strings/util.jl#L132 calling first on an integer result from search, I guess we could leave first and last alone, I was trying to get greedy.

tkelman commented 8 years ago

Defining collect(x::Number) = [x] can also get through bootstrap. See https://gist.github.com/5cc35991da822df921d4743b9799eadf for a diff that bootstraps and a log of all the test failures. Turns out to be pretty deep in a lot of places.

JeffBezanson commented 8 years ago

Some of those are easy to fix. I spot:

vtjnash commented 8 years ago

Since it seems this is going to be non-trivial to remove, and it's not blocking anyone from using Julia, I'm going to reassign this for v0.6 so we can focus on real issues.

JeffBezanson commented 8 years ago

I'm fine with that.

mschauer commented 8 years ago

To give an opposite standpoint I think that if a number has size(), it should also be iterable and possible to collect a number but I guess that I am too late to the party.