JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.68k stars 5.48k forks source link

Broadcasting with Set returns an arbitrarily-ordered array #33282

Open tkf opened 5 years ago

tkf commented 5 years ago

Currently (Julia 1.4-DEV), broadcasting with Set produces a result that relies on iteration order of Set which is not meaningful:

julia> Set([1,2]) .+ Set([3,4])
2-element Array{Int64,1}:
 6
 4

This is due to the fallback to collect in broadcastable definition:

https://github.com/JuliaLang/julia/blob/9a8b2fd72b675bb8a5bf0322943ee9451787b86a/base/broadcast.jl#L664

A sensible definition may be to use set arithmetic (ref: https://github.com/JuliaMath/IntervalSets.jl/pull/55) such that, for example,

op.(set1, set2) == Set([op(x, y) for x in set1, y in set2])

holds. However, given that "join" on indices/keys is proposed as a generalized form of broadcasting #25904 (and special handling for singleton axes), it is not clear how taking "product" for broadcasting with Sets can fit in a bigger picture.

So, as a short-term solution, I propose to forbid broadcasting with Sets which can easily be done by defining broadcastable(::AbstractSet), as done with Dicts:

https://github.com/JuliaLang/julia/blob/9a8b2fd72b675bb8a5bf0322943ee9451787b86a/base/broadcast.jl#L665

This, of course, requires core devs to decide that this can be treated as a "minor change."

As a longer-term solution, does it make sense to use set arithmetic (which, I suppose, implementable using the style mechanism)? Is there a guiding principle in which it is natural to have "product" for Sets and "join" for associative data structures (including Arrays)?

dlfivefifty commented 5 years ago

I’m in favour of adding a SetStyle for broadcasting sets as set definition, for the reason it’s useful.

As a counterpoint, we sort of already have this with adjoints in the sense if x and y are vectors then

z in f.(x,y’)

if and only if z = f(x[k],y[j]).

tkf commented 5 years ago

As a counterpoint, we sort of already have this with adjoints

I would still treat this a join combined with a special treatment (= broadcasting) of a singleton axis. I added "(and special handling for singleton axes)" in the OP to clarify my point.

tkf commented 5 years ago

If we were to use set arithmetic for broadcasting with sets, I'd argue we should forbid mixing it with arrays. It may be tempting to "promote" arrays or any iterators as sets. However, it would produce different results depending on how you associate operators. For example, in

x :: Set
y :: Vector
z :: Vector

a = x .+ y .+ z  # equivalent to (x .+ y) .+ z
b = x .+ identity(y .+ z)

a and b would have different results. This is very odd given that + is associative.

dlfivefifty commented 5 years ago

I would propose the following:

  1. f.(x::Set) should return a Set, so the following is wrong:
    
    julia> x = Set([1,2,3])
    Set([2, 3, 1])

julia> exp.(x) # Should return Set([exp(2),exp(3),exp(1)]) 3-element Array{Float64,1}: 7.38905609893065 20.085536923187668 2.718281828459045

2. `f.(x::Set, 2)` should also return a `Set`
3. `f.(x::Set, y::Set)`, `f.(x::Set, y::Vector)` etc. should probably just error out 

So the rough implementation would be:
```julia
struct SetStyle <: BroadcastStyle end
BroadcastStyle(::Type{<:AbstractSet}) = SetStyle()
BroadcastStyle(::SetStyle, ::DefaultArrayStyle{0}) = SetStyle()
BroadcastStyle(::SetStyle, ::SetStyle) =error("Cannot broadcast sets together")
BroadcastStyle(::SetStyle, ::AbstractArrayStyle) = error("Cannot broadcast sets and arrays")

copy(b::Broadcast{SetStyle}) = Set(broadcast(b.f, collect.(b.args)))
dlfivefifty commented 5 years ago

If we really want to argue that f.(x::Set, y::Set) should be defined, we could say that a set behaves like a singleton/scalar and by the definition of broadcast

Singleton and missing dimensions are expanded to match the extents of the other arguments by virtually repeating the value

which would essentially be equivalent to set functions. But I think that may be pushing things....

nalimilan commented 5 years ago

See also previous attempt at making this an error at https://github.com/JuliaLang/julia/pull/28491. Not sure why it wasn't merged at the time, since map was deprecated (https://github.com/JuliaLang/julia/pull/26980).

Cc: @mbauman

tkf commented 5 years ago

@nalimilan Thank you. Reviving #28491 seems to be the easiest short-term option.

dlfivefifty commented 5 years ago

The decision to deprecate map seems odd to me, the discussion in https://github.com/JuliaLang/julia/issues/26359 didn't really clarify the problem.

Based on those issues and PRs however the current behaviour is desired. Unfortunately its not very compatible with uncountable sets a la IntervalSets.jl , and its not clear what the analogue of map or broadcast should be....

tkf commented 5 years ago

we could say that a set behaves like a singleton/scalar and by the definition of broadcast

@dlfivefifty I'm thinking that we don't have to follow the current definition strictly if there is a way to generalize the definition to have useful broadcasting with sets. Taking product is certainly useful and probably OK provided that mixing sets with arrays and dicts is an error but I'd be interested in what others think about this.

dlfivefifty commented 5 years ago

I suppose the issue is that there is a desire for broadcast(f, a, ...) to return the same thing as broadcast(f, collect(a), ...). With set products that wouldn't be true. But for other cases I would say it is true, for a certain point of view: e.g., x in broadcast(f, Set([1,2,3]), ...) if and only if x in broadcast(f, [1,2,3], ...) so the two "sets" are the same.

vtjnash commented 5 years ago

iteration order of Set which is not meaningful

It's not defined, but it is meaningful, and so it's not "broken". For example, here's a fairly useless one-liner for doing random things to the values in a dictionary by iterating 3 sets:

julia> D = Dict(1 => 2, 3 => 4, 5 => 6);
julia> Dict(keys(D) .=> tuple.(values(D), values(D) .^ 2))
Dict{Int64,Tuple{Int64,Int64}} with 3 entries:
  3 => (4, 16)
  5 => (6, 36)
  1 => (2, 4)

This also doesn't consider OrderedSet, where the order is defined.

not sure why it wasn't merged at the time, since map was deprecated

map was deprecated because it was doing a special case to make it return a Set which gave surprising results due to the failure to preserve order and count (#26359) and meant associativity was broken for Set. broadcast has currently generally fallen back to iteration (c.f. decision on Ref usage vs. scalarization. Emphasis on currently, since there are open issues proposing to use indexing more heavily.)

f.(x::Set) should return a Set

This is what map used to do, which is why that method was deprecated and not the other.

tkf commented 5 years ago

@vtjnash I think your observation is that current broadcasting definition is useful when the order has some meaning as in the case of KeySet? My main concern is when there is no meaning in the iteration order, as in Set.

dlfivefifty commented 5 years ago

I agree with @tkf. Broadcasting should only work between ordered collections (including KeySet), where a Set is unordered.

More broadly, it might be good to sort out what AbstractSet means. It seems that people are taking sets to be collections which store only one entry, but I would disagree and think the defining feature is implementing in. So I'd make the following definitions:

  1. A set is any type that implements in. This includes Set, Interval, and AbstractArray
  2. An enumerable set is any type with a countable number of entries which can be iterated over, but where the order may not be defined. This includes Count, Set, and AbstractArray (including infinite arrays) but not Interval.
  3. An ordered set is any enumerable set where the ordering is defined. This does not include Set but includes KeySet, AbstractArray, and Count.
  4. An indexible set is an ordered set which implements getindex. This includes AbstractArray, Broadcasted, but not KeySet.
  5. A shaped set is an indexible set with axes. This includes AbstractArray and probably Broadcasted.

So broadcasting is well-defined for shaped sets. If we assume unshaped ordered sets default to "vector-like", then we can extend the definition for any ordered sets. This leaves sets which are not ordered (Interval and Set). This should probably be an error.

Edit: forgot a key one:

\6. A uniquely enumerable set is an enumerable set where each element appears exactly once. This includes Set, KeySet, Count, and Range, but not Array.

tkf commented 5 years ago

I think definition 6 is closer to sets in math although they doesn't have to be enumerable. Definition 1 seems to be closer to what is called multiset or bag.

Definition 4 seems to be equivalent to what it's called associative data structure. Arraylike suggested in #32940 seems to cover definition 5.

But I guess you are trying to formalize the capabilities of containers? By maybe using traits? For example does adding Uncountable <: IteratorSize help formalizing definition 2? Adding traits for ordering and indexability seems useful too. The end result would be to error out in broadcastable if IteratorSize does not return HasLength or HasShape. (However, using infinite size iterator like Iterators.take((@lazy 2 .^ iter_all_primes() .- 1), 10) can be useful if we have non-materializing broadcast #19198)

This leaves sets which are not ordered (Interval and Set). This should probably be an error.

I think I agree. But as a brainstorming, I think one possibility is to use:

But this probably is "too dynamic" for people reading the code. I wish there was a generic customizable lifting syntax... (see also Status of question mark syntax for missing values - Development - JuliaLang)

dlfivefifty commented 5 years ago

I think definition 6 is closer to sets in math

This is definitely not true (see Wikipedia). Definition 1 is the closest, in that a set only has the operation in. All other set operations (union, intersection) are derived from the in operation. For example, we can easily derive a lazy union/intersection type that works for any types defined by Definition 1:

struct LazyUnion
  a
  b
end

in(x, d::LazyUnion) = x in d.a || x in d.b

Therefore LazyUnion satisfies Definition 1 provided a and b satisfy Definitino 1.

(The axiom of choice comes into play whether there is also a "choice" function but that's off topic.)

product-based broadcasting if all collections (except numbers) are not indexable ...But this probably is "too dynamic" for people reading the code

Agreed, though I would argue that its well-defined and consistent provided there is only one one set. Actually, I'd be happy if base just throws a "method not found" and then there could be another package SetBroadcasting.jl that implements a SetStyle for those who want set-like broadcasting.

tkf commented 5 years ago

Hmm... it still sounds problematic to me because, if an object has more properties than set, it can distinguish itself from the other by using such properties (ordering, multiplicity, etc.). For example, if you treat an Array as a set, using Base.== would be problematic because it does not follow the axiom of extensionality; i.e., arrays can use shape, ordering, and multiplicity to distinguish one from the other. I suppose you can say that the equality defined in terms of in should be used when treating arrays as sets. But this is equivalent to saying that arrays can be converted to sets which is not a very specific property and I think reduces the usefulness of the notion of set.

tkf commented 5 years ago

Actually, I'd be happy if base just throws a "method not found" and then there could be another package SetBroadcasting.jl that implements a SetStyle for those who want set-like broadcasting.

I agree that this is nice to have. But I prefer to do this without type-piracy, for example, using syntax like

setstyle.(set1 .+ set2)

where setstyle is a "magic" function that sets the broadcasting style.

Note that this requires us to use an approach different from #28491 which uses broadcastable(::AbstractSet) = throw(...). We need to add extra code to allow users to construct Broadcasted object and throw an error at (say) copyto! stage. This approach may be useful for #25904 as well (people can experiment with different implementation outside Base).

cossio commented 2 years ago

Bump. I would like to rekindle the discussion here. I recently had a bug related to this issue. I think we need a better solution here.

Moelf commented 2 years ago

So, as a short-term solution, I propose to forbid broadcasting with Sets which can easily be done by defining broadcastable(::AbstractSet), as done with Dicts:

is there objection to this? I don't think the current behavior is reliable so hopefully no code relies on the behavior (need PkgEval in any case)

cossio commented 2 years ago

I'd be fine with things like Set([1,2]) .+ Set([1,2]) and [1,2] .+ Set([1,2]) being errors.