JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.92k stars 5.49k forks source link

sparse-constructor does not always purge zeros #9928

Closed mauro3 closed 8 years ago

mauro3 commented 9 years ago

Sparse constructor sparse should purge all zeros but it does not when combining two entries to zero:

julia> sparse([1,1], [2,2], [1,-1])
1x2 sparse matrix with 1 Int64 entries:
        [1, 2]  =  0

expected would be

julia> sparse([1,1], [2,2], [1,-1])
1x2 sparse matrix with 0 Int64 entries:
mlubin commented 9 years ago

If handling this case requires a lot of extra work and reallocations, I wouldn't say this is necessarily a bug.

tkelman commented 9 years ago

What do Matlab, Octave, or SciPy do here?

mauro3 commented 9 years ago

I think matlab never stores explicit zeros. Certainly removes them for this case.

SciPy is in the other camp and never removes zeros unless if explicitly told with c.eliminate_zeros().

Looks like Julia is in the middle.

StefanKarpinski commented 9 years ago

Is it just me or does being in the middle seems like the worst of both worlds? Both of the other behaviors are easy to predict and understand. When you currently ask "does this zero end up stored", the only answer we can give is "it depends".

tkelman commented 9 years ago

To some extent, yes. I think the idea at the moment is we are trying to have a high-level, user-friendly interface for sparse matrices that does remove zeros, and a low-level, high-performance way of working with sparse matrices for experts and library authors that doesn't. If we're going to try to pick a direction and move more consistently one way vs the other, I'd very much prefer going more in the SciPy direction than the Matlab direction.

mauro3 commented 9 years ago

Yes, I agree, SciPy direction would be better. Also would align with Julia's philosophy of not being too magical.

A bit off topic, here what happens in SciPy if one assigns to a not allocated spot:

In [34]: c[1,1] = 4
/usr/lib/python3.4/site-packages/scipy/sparse/compressed.py:730: SparseEfficiencyWarning: Changing the sparsity structure of a csc_matrix is expensive. lil_matrix is more efficient.
  SparseEfficiencyWarning)
mlubin commented 9 years ago

I'm not really bothered by the behavior because zeros are harmless, but I also prefer the SciPy approach.

ViralBShah commented 9 years ago

They are not completely harmless. Stored zeros will make some of our solvers trip, IIRC. We do need to remove these zeros in the default user-facing APIs. If someone wants to use stored zeros, I think it is reasonable to expect that they will work with the CSC data structure.

tkelman commented 9 years ago

I don't think removing them automatically is the right answer here. If there are a handful of places where they cause problems, then those call sites should explicitly canonicalize and remove stored zeros on entry, and these limitations should be documented.

timholy commented 9 years ago

Also agreed that SciPy-like is the way to go here, and +1 for @tkelman's proposal.

ViralBShah commented 9 years ago

What do stored zeros mean here? In the example presented by @mauro3, did the user intend to have a stored zero, or expected that the zero be removed?

The current sparse matrix design removes zeros that may appear as a result within an operation, much like Matlab and octave. There are cases like this one, where the stored zero is unintentional, and should be considered a bug in the current design.

If we choose to go the SciPy route, then we could decide not to check for zeros that may appear in the course of an operation, and that they become stored zeros. In such a case, sparse() and setindex may then allow a user to provide zeros in the input and they can get stored. However, this is a design decision, and requires the entire sparse matrix implemenation to conform to this. It would eliminate zero checking in many places.

I am guessing that a lot many users are familiar with the Matlab/Octave design, but developers like the SciPy choice, as it is much more flexible. I personally would be ok if we all want to change to the SciPy style. However, calling the current behaviour ok, because we allow for stored zeros is confusing, as @StefanKarpinski notes. It should be well defined, consistent across all operations, and predictable.

zouhairm commented 9 years ago

As an end user, I want to throw my 2 cents in: definitely confusing that I can't figure out whether to expect nzval's to always be non-zero or not.

As pointed out in my email to the julia-users google group, this behavior is particularly confusing to an end user (especially that the help for nnz and countnz imply that they are equivalent for sparse matrices)

help?> nnz Base.nnz(A) Returns the number of stored (filled) elements in a sparse matrix. help?> countnz Base.countnz(A) Counts the number of nonzero values in array A (dense or sparse). Note that this is not a constant-time operation. For sparse matrices, one should usually use "nnz", which returns the number of stored values.

A = spdiagm(ones(5)); println(nnz(A), countnz(A)) #prints 55 A[1,1] = 0 println(nnz(A), countnz(A)) #prints 44 A.nzval[1] = 0 println(nnz(A), countnz(A)) #prints 43

tkelman commented 9 years ago

Part of the problem is when you say A[1,1] = 0 with a sparse A, you could mean a few different things. If A[1,1] is a stored nonzero, removing it from the sparse matrix requires an expensive reallocation of the entire data structure. If this sparse matrix is, say, the Jacobian in a nonlinear optimization problem where you're reusing the same structure repeatedly with different nonzero values, you could get a value that happens to equal 0 but only for a single iteration, and you don't want to reallocate the entire data structure for that.

We might need either different ways to say "0" where one means stored and one means non-stored. Or different behaviors for different methods of assignment, which is what we have now. The documentation should certainly be clearer that a "stored nonzero" can have a value equal to 0.

mlubin commented 9 years ago

@zoohair, everyone wants something different from sparse matrices, so it would help to hear a bit about what you're trying to do with them. Currently, sparse matrices will only have stored zeros if the user explicitly puts them there.

zouhairm commented 9 years ago

Sure.

I'm using sparse matrices to store transition probabilities T in an MDP. The state space is huge, but only few transitions are possible. When it comes to solve the MDP there's a matrix-vector T*v multiplication to be done (v is not sparse). This is done one row at a time for algorithmic purposes (speeds up convergence).

In the past, I have simply use dot() and passed it the appropriate row of T and v, but I have an application where the values in v are repeated (i.e. v = [v1, v1, v1, v1, ...]), and rather than storing all of v, I want to just store the subvector v1 and still carry out the dot product. So I just want to iterate over the relevant columns of T and index into v1 appropriately to carry out the dot product.

So if there are zeros stored in the matrix, it doesn't make a difference since they won't change the result. But the fact that in order to get the non-zero indices findn needs to check all entries to be != 0 seems inefficient to me... I'm find if findn returned indices to the "structural nonzeros", but of course I can see how this would be an issue for other users who expect the entries to be non-zero.

mlubin commented 9 years ago

Could you post some pseudocode on how you plan on extracting a row using findn? There's a fast way which doesn't involve findn and a number of potentially very slow ways. In either case, the overhead of checking for zeros in findn is negligible compared with making sure that the rest of the operation is implemented efficiently.

stevengj commented 9 years ago

See also the discussion of nnz in #6769.

zouhairm commented 9 years ago

@mlubin: I realized that Julia is using CSC instead of CSR, so finding the rows for a given column is easier. So I'm just going to store the transpose of the matrix I care about and use this instead:

#Returns row indices for non-zero entries in a given column
function findn_rows{Tv,Ti}(S::SparseMatrixCSC{Tv,Ti}, colIdx::Integer)
    return S.rowval[S.colptr[colIdx] : (S.colptr[colIdx+1]-1)]
end
mlubin commented 9 years ago

@zoohair, yes that seems sensible. Be sure to avoid accessing elements as S[i,j] inside a loop.

ViralBShah commented 9 years ago

To fix this, we just need a pass to remove any stored zeros in the sparse constructor, if they occur from combination. This is pretty straightforward, but if lazy, one could even just translate the cs_fkeep routine in CSparse.

KristofferC commented 9 years ago

Don't we alread have fkeep? https://github.com/JuliaLang/julia/blob/3aed3f5489ab3d12155ba112644b3d86d7b42658/base/sparse/csparse.jl#L337

We also have dropzeros which takes away all stored zeros.

ViralBShah commented 9 years ago

Thanks. I was doing this at an airport with my laptop running out of charge. For now, I'll incoprorate this, and the larger discussion on stored zeros can continue elsewhere later.

tkelman commented 8 years ago

closed by #15242