JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.74k stars 5.48k forks source link

behavior of sum, prod, mapreduce... on empty arrays #8092

Closed my-little-repository closed 9 years ago

my-little-repository commented 10 years ago

I just upgraded to julia version 0.3.1-pre+4. Now sum([]), prod([]), reduce(+,[]), mapreduce(x->0,+,[]) all throw an error whereas a sound value was returned in julia v0.2 (namely 0,1,0 and 0).

As far as I understand, this comes from the fact that these functions are based on mapreduce and that mapreduce now throws an error on the empty array. Previously, it was returning op(), which was very nice since +() returned 0 and *() returned 1, which gave exactly what was needed/expected.

Now +() returns an error whereas *() returns "". I don't understand why strings were privilegied over numbers here. Julia is targeted at numerics after all. I really would like the previous behavior back. I must now check manually that the array argument is non-empty every time I use sum,prod, reduce... It is both tedious,ugly and error prone (everybody will forget to check their sum). And we can always return a meaningful error if op() is not defined.

timholy commented 10 years ago

What type of 0 would you like back? Int, Uint8, Float32, or something else?

timholy commented 10 years ago

There is no zero(Any), and Any is the element type of []. I should point out that

julia> sum(Float32[])
0.0f0

So all you have to do it make sure your arrays are typed.

johnmyleswhite commented 10 years ago

Nitpick: I think the element type of [] is actually None.

timholy commented 10 years ago

You are indeed correct!

JeffBezanson commented 10 years ago

Argh I dislike * on strings more all the time. *() should not be defined, or at the very least shouldn't return a string. There used to be a *() = 1, and removing it exposed this behavior.

+() = 0 was nice in some cases, but doesn't really work in general since it depends on what group you're working in.

Keno commented 10 years ago

So, here is a fun thought:

*() = true
+() = false
*(x::String,b::Bool) = b?x:error()
my-little-repository commented 10 years ago

@timholy Is it really a problem? 0 and 1 play well with all the number types. I think that Float64 was the default type for numerics on 64bit machines. A lot of stuff was silently promoted to Float64 in the previous version. I felt that by default, julia was targeted at numerics using the basic underlying type of the machine. Is it now taking a more symbolic approach?

Also, thanks for the typing advice, unfortunately it does not work for mapreduce.

  julia> mapreduce(x->whatever,+,Float64[])
 ERROR: Reducing over an empty array is not allowed.

If you want to return a number-type-agnostic result, maybe it is possible to define a mathconst 0 and a mathconst 1, like pi or e. 0 is certainly as constant, and as remarkable, as pi. Or create a new type for these remarkable numbers.

JeffBezanson commented 10 years ago

It would be nice for mapreduce to do whatever reduce is doing:

julia> sum(Int[])
0

julia> reduce(+,Int[])
0

julia> reduce(*,Int[])
1

julia> mapreduce(identity,+,Int[])
ERROR: Reducing over an empty array is not allowed.
 in _mapreduce at reduce.jl:154
 in mapreduce at reduce.jl:176

Can't be too sure offhand, but it seems like that should work.

timholy commented 10 years ago

Is it really a problem? 0 and 1 play well with all the number types.

Not really:

julia> using SIUnits

julia> x = 2Meter
2 m 

julia> x + 0
ERROR: Unit mismatch. Got (m ) + ()
 in error at error.jl:21
 in + at /home/tim/.julia/v0.3/SIUnits/src/SIUnits.jl:182
 in + at promotion.jl:158

julia> isa(x, Number)
true

Your MathConst idea is indeed clever and would be effective, but it has a major downside: it would make all of the reduction functions type-unstable.

Bummer about mapreduce though. One would want that to work. Of course, one subtlety is that if f(x::T1)::T2, but you don't have any elements to test out f on, then you have to rely on the results of type inference. Which I think we don't do right now.

timholy commented 10 years ago

And of course type-instability also applies if you return 0::Int or 0.0::Float64.

my-little-repository commented 10 years ago

@timholy But this is the desired behaviour with SIUnit, no? You want to return an error if meters are added to some number which has not been explicitely typed as meter. The goal here is to prevent conversion errors and that works.

When it comes to the mathconst idea, it may not be so clever, I just remembered that pi is not handled everywhere e.g. [https://github.com/JuliaLang/julia/issues/7994].

I guess what is needed here for +() is a 0 of type number. And then adding it to something would simply return that something (fast! Actually can't be faster and is type-safe). Note that zero and one are the only candidates to be instances of the abstract type Number since they are the only one having representatives in all the related concrete subtypes. But we can't instantiate an abstract type...

Or a zero with methods

 julia> methods(0)
  0::Int64  at mysterious.jl:2718
  0::Uint8  at mysterious.jl:314
  0::Float64 at mysterious.jl:1.0
  etc

I still think at the moment that the most practical approach is to return the underlying machine type.

timholy commented 10 years ago

There is a zero function, but it can't be defined for all types. For example, what is zero(Array{Int,2})----how big should the matrix be?

timholy commented 10 years ago

But I think the bottom line is that if mapreduce can be made to mimic the behavior of the others, you'd be happy, right? Care to look into how the others work and see if you can port the behavior over to mapreduce?

JeffBezanson commented 10 years ago

@timholy You're right about the reason for the mapreduce behavior.

toivoh commented 10 years ago

How about an optional argument of mapreduce that gives the fallback value for empty iterables / a starting value for the reduction? That should let the user handle most cases where this is an issue.

my-little-repository commented 10 years ago

Ah! Toivoh's suggestion provides an easy fix to the problem.

  julia> mapreduce(x->2x,+,[])
  ERROR: Reducing over an empty array is not allowed.
   in _mapreduce at reduce.jl:154
   in mapreduce at reduce.jl:176

  julia> mapreduce(x->2x,+,0,[])
  0

So I can specify the (neutral) value for the empty list just by providing a starting argument and recover the previous behavior. This is a minimal change. Thanks. Maybe this could be explained in the v0.3 News file?

@timholy zero(Array{Int,2}) returns an error and this is fine. The point is that 1 should be returned for the * operator. op() was a nice way to specify the neutral element for the operator. Anyway, I can live with the current situation. I still think that this is an unnecessary breakage in v0.3 that could have waited v0.4.

IainNZ commented 10 years ago

@my-little-repository 0.3 has multiple breaking changes, there was no reason to wait. I also notice you keep using [] - you should be using {} or Type[], [] never comes up in practice.

my-little-repository commented 10 years ago

@IainNZ nitpicking but you can grep your julia files and find many occurrences of [] (45 for me; you want it in your test files for example, but you can find it also in lapack.jl or array.jl).

Anyway I hope it did not obscur the fact that my complaint was about the new behavior of mapreduce breaking production code (that uses typed arrays) in the v0.3 stable release, not about [] specifically.

By the way, it seems that mapreduce (and probably reduce) has become slow on some benchmarks. I get

  julia> @time mapreduce(big,*,1:20000);
  elapsed time: 0.308556927 seconds (305406800 bytes allocated, 64.66% gc time)

Here is what I had a few months ago

  julia> @time mapreduce(BigInt, *, 1:20000);
  elapsed time: 0.040044266 seconds (3167912 bytes allocated)

Unfortunately I don't remember what version I was using and I uninstalled v0.2, so I can't be sure. I will try to reproduce and submit a report if it is confirmed.

@timholy It would indeed be nice if mapreduce was made to mimick the behavior of reduce when evaluated on an empty array.

kmsquire commented 10 years ago

@my-little-repository, are those uses of [] typed or untyped? Iain was just pointing out that untyped [] is equivalent to None[] in Julia, which is pretty useless (it can hold all of the None values you want, and nothing else).

If you see any uses without type annotations in front on base, can you open up an issue or PR? Thanks!

timholy commented 10 years ago

That's a pretty major performance regression. It's mostly here. Has something about BigInts changed? I know nothing about them personally.

my-little-repository commented 10 years ago

@kmsquire Untyped, I think. Here is the list I get from running M-x rgrep RET " \[\]" RET *.jl RET ~/path/to/source/directory RET in emacs (you can also exec the following form in the scratch buffer

  (let ((sources "~/path/to/julia/sources/"))
      (rgrep " \\[\\]" "*.jl" sources))

assuming you're using emacs and grep is installed. Actually this could be done with a simple julia programme but I don't see a facility in base to easily recurse through a directory.)

This is with an install of julia yesterday from source. Version 0.3.1-pre+4 (2014-08-22 11:13 UTC)

  ./test/perf/kernel/json.jl:164:                val = []
  ./test/linalg3.jl:211:@test diag(A,-4) == []
  ./test/linalg3.jl:218:@test diag(A, 3) == []
  ./test/linalg3.jl:221:@test diag(zeros(0,0)) == []
  ./test/linalg3.jl:225:@test diag(zeros(1,0)) == []
  ./test/linalg3.jl:226:@test diag(zeros(1,0),-1) == []
  ./test/linalg3.jl:230:@test diag(zeros(0,1)) == []
  ./test/linalg3.jl:231:@test diag(zeros(0,1),1) == []
  ./test/functional.jl:16:# map over [] should return []
  ./test/functional.jl:18:@test isequal(typeof(map(x -> x, [])), Array{None,1})
  ./test/functional.jl:32:@test isequal(filter(x->(x>10), [0 1 2 3 2 1 0]), [])
  ./test/random.jl:3:@test find(x .== rand()) == []
  ./test/core.jl:1010:    p = []
  ./test/arrayops.jl:665:    c1 = mapslices(x-> maximum(-x), a, [])
  ./test/arrayops.jl:876:@test []' == Array(None,1,0)
  ./test/ranges.jl:211:@test [0.0:-0.1:1.0]  == []
  ./test/ranges.jl:212:@test [0.0:0.1:-1.0]  == []
  ./test/ranges.jl:224:@test [0.0:-1.0:0.5]  == []
  ./test/collections.jl:392:                ("hello", 23, 2.7, (), [], (1,8)))
  ./test/statistics.jl:249:@test hist([])[2] == []
  ./test/strings.jl:509:@test isequal(split("", ',', false), [])
  ./test/strings.jl:510:@test isequal(split(",", ',', false), [])
  ./test/strings.jl:511:@test isequal(split(",,", ',', false), [])
  ./test/strings.jl:529:@test isequal(rsplit("", ',', false), [])
  ./test/strings.jl:530:@test isequal(rsplit(",", ',', false), [])
  ./test/strings.jl:531:@test isequal(rsplit(",,", ',', false), [])
  ./usr/share/julia/helpdb.jl:795:      julia> a = []
  ./base/pkg.jl:62:generate(pkg::String, license::String; force::Bool=false, authors::Union(String,Array) = [], config::Dict=Dict()) =
  ./base/show.jl:562:            show_block(io, "catch", is(args[2], false) ? [] : args[2], args[3], indent)
  ./base/show.jl:565:            show_block(io, "finally", [], args[4], indent)
  ./base/linalg/lapack.jl:3360:                    &jobvs, &'N', [], &n, 
  ./base/linalg/lapack.jl:3363:                        &lwork, [], info)
  ./base/linalg/lapack.jl:3404:                    &jobvsl, &jobvsr, &'N', [], 
  ./base/linalg/lapack.jl:3408:                    &ldvsr, work, &lwork, [], 
  ./base/linalg/lapack.jl:3450:                    &jobvs, &sort, [], &n, 
  ./base/linalg/lapack.jl:3453:                        rwork, [], info)
  ./base/linalg/lapack.jl:3495:                    &jobvsl, &jobvsr, &'N', [], 
  ./base/linalg/lapack.jl:3499:                    work, &lwork, rwork, [], 
  ./base/abstractarray.jl:794:hvcat(rows::(Int...)) = []
  ./base/dSFMT.jl:589:for (lhs, rhs) in (([], []), 
  ./base/LineEdit.jl:85:complete_line(c::EmptyCompletionProvider, s) = []
  ./base/LineEdit.jl:639:default_completion_cb(::IOBuffer) = []
  ./base/replutil.jl:65:            print(io, "Use square brackets [] for indexing.")
  ./base/array.jl:592:const _default_splice = []
  ./doc/helpdb.jl:795:      julia> a = []

This looks fine to me, I am not sure why [] is frowned upon. Maybe in array.jl the const _default_slice should be typed? hvcat returning [] may be suspicious also. But then what should it return? {}?

  julia> hvcat(1)
  0-element Array{None,1}
  julia> [[],[1,3,3]]'   # nice
  1x3 Array{Int64,2}:
   1  3  3
  julia> [{},[1,3,3]]'   # not so nice
  1x3 Array{Any,2}:
   1  3  3

Should I submit an issue?

StefanKarpinski commented 10 years ago

You can't add any elements to [] which means it's a bit limited and unintuitive. We are, however, probably going to make [] create an empty Vector{Any} and also make comprehensions (and map, etc.) produce arrays whose element type is the typejoin of their values. That will make [] growable and thus much more useful.

nalimilan commented 10 years ago

@my-little-repository You'll notice that the vast majority of the occurrences are for equality checks in tests, which is fine since all empty sets are equal, disregarding the type of their elements. There are some occurrences which would need further examination, though.

my-little-repository commented 10 years ago

@timholy Concerning mapreduce(big,*,1:20000), if it is using the sequential algorithm, then it is necessarily slower than what it used to be. In previous versions (at least until may: 0.3.0-prerelease+3303 (2014-05-28 19:23 UTC)) it was using the pairwise algorithm which performs far better on that test.

In the sequential algorithm ((((1*2)*3)*4)...), the switch to BigInt values occurs at the twentiest iteration. So most products are products with a BigInt (slow). In the pairwise algorithm ((1*2)*(3*4))*((5*6)*(7*8))) the large majority of products are products of Int64 (fast). The recursive pairwise algorithm beats loops as soon as the cost of op(a,b) increases during the computation.

my-little-repository commented 10 years ago

I will open separate issues for the mapreduce performance regression and the untyped [] check.

8100

8101