JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.77k stars 5.49k forks source link

`map` on sparse arrays does not work with non-numeric data #19561

Closed amitmurthy closed 7 years ago

amitmurthy commented 7 years ago
julia> map(x-> x != 0.0 ? "Hello" : x, sparse(eye(4)))
ERROR: MethodError: no method matching zero(::Type{Any})

Also a problem when mixing ints and floats.

julia> map(x-> x != 0.0 ? 1 : x, sparse(eye(4)))
ERROR: MethodError: no method matching zero(::Type{Any})

If sparse arrays are not meant to be used with non-numeric data, we should at least throw a better error message.

oscardssmith commented 7 years ago

What version of Julia are you using? on 0.5.0 this works

amitmurthy commented 7 years ago

master. map was recently implemented for sparse arrays.

Sacha0 commented 7 years ago

The new map[!]/broadcast[!] methods for SparseMatrixCSCs check whether the map/broadcast operation yields zero when all arguments are zero. Specifically, for e.g. map(f, A, B), the check is

fofzeros = f(zero(eltype(A)), zero(eltype(B)))
fpreszeros = fofzeros == zero(fofzeros)

This check requires that (1) the input array eltypes provide zero and (2) the return type of f (evaluated for arguments of the input array eltypes) provides zero.

The first requirement is fundamental: A sparse matrix C is only well defined if zero(eltype(C)) is well defined, as otherwise C's unstored entries are not well defined (in which case maping over C makes little sense).

The second requirement is relaxable: We need only the result of the hypothetical iszero(fofzeros). Not having iszero, we must use either fofzeros == zero(fofzeros) or fofzeros == 0 (alternatives?). The former handles units gracefully, but fails where zero(fofzeros) is not defined. The latter does not handle units gracefully, but works in some other cases where zero(fofzeros) is not defined.

Both examples above violate the second requirement and would work given iszero (without tradeoff) or with fofzeros == 0 in place of fofzeros == zero(fofzeros) (trading off handling units).

Thoughts? Best!

oscardssmith commented 7 years ago

alternatively wouldn't defining zero for type any work? I'm not sure what that would be though.

tkelman commented 7 years ago

If you only ever do operations on the stored entries (or their locations), then it doesn't always matter whether the unstored entries have a realizable value in the same element type as the stored entries. I've found sparse matrices of symbols or functions to be useful on occasion as a data structure, where it was useful to have them in CSC format, do concatenations or similar operations.

We should just add iszero.

Sacha0 commented 7 years ago

If you only ever do operations on the stored entries (or their locations), then it doesn't always matter whether the unstored entries have a realizable value in the same element type as the stored entries. I've found sparse matrices of symbols or functions to be useful on occasion as a data structure, where it was useful to have them in CSC format, do concatenations or similar operations.

Absolutely, SparseMatrixCSCs with only stored entries well defined have utility. Applying whole-array operations (e.g. map) to such objects makes little sense though; the hypothetical mapstored seems the appropriate operation in that case.

We should just add iszero.

+1 :). Best!

nalimilan commented 7 years ago

But how would iszero help here? We wouldn't define it on Symbol, String nor Function, right?

Sacha0 commented 7 years ago

But how would iszero help here? We wouldn't define it on Symbol, String nor Function, right?

In both cases above, iszero(fofzeros) could check fofzeros == 0 rather than fofzeros == zero(fofzeros), skirting lack of definition of zero for Any?

Sacha0 commented 7 years ago

Looked closer and I'm mistaken. The zero preservation check discussed above works as written in these cases. Rather, checking whether output entry Cx = f(xs...) is zero via Cx != zero(eltype(C)) causes the reported failures. Replacing that check with iszero(Cx), or shy of that Cx != zero(Cx), would fix the second reported failure (the output sparse matrix having mixed numeric type). But with a bit more thought, the present semantics of map[!]/broadcast[!] over sparse matrices (not storing zeros) preclude making the first reported failure work without the pun of defining zero for Strings (or at least comparison with 0). Best!

Sacha0 commented 7 years ago

But with a bit more thought, the present semantics of map[!]/broadcast[!] over sparse matrices (not storing zeros) preclude making the first reported failure work without the pun of defining zero for Strings (or at least comparison with 0).

That thought was silly. iszero would save us in that case as well, there being no problem with comparison against 0: Replacing Cx != zero(eltype(C)) with iszero(Cx) and having the fallback iszero(x) = x == 0 would fix the first reported failure as well. Thoughts? Best!

stevengj commented 7 years ago

Using sparse arrays/vectors for non-algebraic data (types that do not define zero) seems like a bad pun to me. These are linear-algebra objects.

stevengj commented 7 years ago

For example, what does getindex return for the non-stored entries if zero is not defined?

stevengj commented 7 years ago

If you just want a general "sparse" associative "array" from Int to a random type T, you should use Dict or similar.

stevengj commented 7 years ago

And if you want to map the nonzero elements of a sparse array to non-numeric values, you should use map(f, nonzeros(A)).

tkelman commented 7 years ago

Dict isn't CSC. It's useful on occasion to have a sparse array that has the same index iteration structure as a numeric SparseMatrixCSC, with effectively undefined non-stored entries. Linear algebra operations won't work on them, but many array manipulation operations will.

stevengj commented 7 years ago

Lots of things are occasionally useful but lead to poor library design. We normally try to avoid bad type puns in Base.

stevengj commented 7 years ago

Could you give a concrete example where nonzeros(A) would not suffice?

tkelman commented 7 years ago

We allow non numeric or heterogeneously typed dense arrays. Allowing for the same with sparse matrices is a feature that it would be a regression to lose. Not being able to do linear algebra on some array types doesn't always make it a pun or bad library design to support.

For this particular example of map I agree nonzeros is mostly good enough. Will have to check whether the assumption that zero(eltype(A)) exists impacts other use cases.

stevengj commented 7 years ago

The definition of sparsity here is "mostly zero", which is what makes it seem like a pun, and rather different from an Associative type (a partial function), which is what you seem to want to use it for.

stevengj commented 7 years ago

The generalization would make more sense if we allowed an arbitrary default value, but that could cause trouble with code expecting "sparse" to mean "default zero"

Maybe we should have a DefaultArray type with an arbitrary default value, and have SparseMatrixCSC be a subtype?

Sacha0 commented 7 years ago

Ref. https://github.com/JuliaComputing/IndexedTables.jl

oscardssmith commented 7 years ago

@stevenj this is a really good idea. It will also provide a path where we don't return dense arrays when we broadcast nonzero preserving functions, which could be a huge win for memory usage.

stevengj commented 7 years ago

If we want to use it for avoiding dense broadcast output, then the sparse case can't be a subtype (wouldn't be type stable). I dunno, maybe all sparse arrays should be DefaultArrays after all. @StefanKarpinski was arguing for this in another issue, and I was skeptical, but maybe it's worth the trouble.

oscardssmith commented 7 years ago

The really big advantage would be that all broadcasts would be efficient and produce the same type no matter what. Also, if I understood his proposal correctly, we could do even better, as his was a way to cheat a default from a sparse, but if it were going on base, we could further simplify the code.

Sacha0 commented 7 years ago

For a potential stopgap solution, please see https://github.com/JuliaLang/julia/pull/17623#discussion_r92890116. Best!

If some feel strongly about the stricter iszero definition while others feel strongly about {map/broadcast over sparse matrices where the output element types don't provide zero}, we can accommodate both desires with a strict iszero definition and a _sparsebc_iszero which wraps iszero but provides the permissive fallback.

Sacha0 commented 7 years ago

Type inference seems to have improved (:tada:) to the point that half of the tests for this issue are ineffective:

julia> intoneorfloatzero(x) = x != 0.0 ? Int(1) : Float64(x)
intoneorfloatzero (generic function with 1 method)

julia> foo = map(intoneorfloatzero, speye(4))
4×4 SparseMatrixCSC{Union{Float64, Int64},Int64} with 4 stored entries:
  [1, 1]  =  1
  [2, 2]  =  1
  [3, 3]  =  1
  [4, 4]  =  1

julia> eltype(foo)
Union{Float64, Int64}

julia> zero(eltype(foo))
0

(Previously zero(eltype(foo)) would fail.) Additionally, I found four code paths not fixed by #19589 and missed by the existing tests. Stronger tests and fix inbound. Best!

amitmurthy commented 7 years ago

Closed by #20862 I believe.

andreasnoack commented 7 years ago

As mentioned in https://github.com/JuliaLang/julia/pull/22945#issuecomment-317923344, the consequence of the fix here might cause some trouble elsewhere. @amitmurthy would you be able to share some real world examples where map on a SparseMatrix{String,Int} would be useful?

amitmurthy commented 7 years ago

The specific requirement was for a SparseMatrix{Ref,Int} in light of how asyncmap (and hence pmap) is implemented. While still a TODO - https://github.com/JuliaLang/julia/blob/0a37b3d42a0b159aad0eebefbcf3280dda2966b4/base/asyncmap.jl#L263-L268 - it can be optimized if SparseMatrix{Ref,Int} continues to be supported.

Other than the above I don't have any other real-world examples for non-numeric sparse arrays.

Sacha0 commented 7 years ago

Above @tkelman mentions use cases for SparseMatrixCSCs of symbols and/or functions:

I've found sparse matrices of symbols or functions to be useful on occasion as a data structure, where it was useful to have them in CSC format, do concatenations or similar operations.