Closed eldadHaber closed 10 years ago
Sorry about this. You can add the following to your code:
const nnz = countnz
Base.nonzeros(s::SparseMatrixCSC) = nonzeros(s.nzval)
and then you won't need to change every use. Wrapping this in if !method_exists(nonzeros,(SparseMatrixCSC,))
will let the code work with both 0.2 and 0.3.
Personally I dislike the decision to allow explicit zeros in sparse matrices, but I think we should bring back the definition of nonzeros
for sparse.
I find myself wanting to use nnz every once in a while. While we can retain stored zeros, it would be nice to have nnz and nonzeros.
My problem is not that changes are done (I'm also not a fan of this change but this is not a big deal). My problem is backward compatibility. We now have a relatively large investment in a julia library for geophysical inverse problems. If today a decision is made to retire a command tomorrow it may be a different one. We cannot commit on building code when basic commands are changed.
My request for the future, please
Thanks
Unless Julia has version 1.0 I don't see the issue when the API is evolving. Yes this can be painful but trying to be backward compatible in all the 0.x releases will not lead to a consistent 1.0 release.
We all know that the documentation can be improved but this is a community project. There is nobody you can blame because of missing / wrong documentation. The project lives from participation.
Independently of this specific issues I think there will be more breaking changes on the way to 1.0. If this is not ok for you (which is absolutely understandable) you might better try a programming environment that is more stable (e.g. Matlab).
Please continue to file issues like this for specific changes that cause problems. There is always a chance something got removed by mistake, or that we will change our minds.
nonzeros
seems to have been botched a bit --- in all other cases (including nnz
) we add deprecation warnings that let code continue to work. Also adding your own aliases and work-arounds like those I posted above has no overhead, so in some cases that will be the right approach.
I think we just changed the method signature for nonzeros
so that it only applies to StridedArray
in c330b1d64efa566911cdd20921ca9419e58bacf9, hence the lack of a deprecation warning. The existing method was inefficient for sparse matrices, but it could be implemented efficiently by adding the definition above.
Yes, but there still should have been a deprecation warning or error for the removed functionality. It was an oversight.
And a highlight of not-very-good test coverage for sparse matrices. When this goes back in it should have a test so accidental removal without a deprecation gets caught in the future.
Well, the removal wasn't accidental. The tests were updated, but not the deprecations. Better test coverage would of course help in general with this kind of thing though.
Nnz does have a deprecation. That is how I keep getting reminded often that I like the name. Perhaps nonzeros may have been missed. Anyways, I will revisit this would matter soon and iron out the issues.
@tkelman You are certainly right about the lack of tests and coverage for sparse matrices. Hoping to significantly improve sparse matrix support over the next few weeks.
Cc: @mlubin @iainnz
I don't think we should restore nnz
, since it's so tempting to use yet would be the wrong choice for the majority of code that works with sparse matrices. As @tkelman mentioned in #6769, the representation details of sparse matrices are very important for writing efficient code. Users should have to explicitly choose whether they want the storage size of the matrix or the exact number of nonzero elements, which requires looping through the matrix.
Sorry @eldadHaber for the churn. The deprecation of nnz
resulted from a change in the structure of SparseMatrixCSC
to allow explicit nonzero entries in the matrix, which is useful in a number of applications. This means that the exact number of nonzeros cannot be computed in constant time anymore. We believed that an explicit warning would be better than silently changing an O(1) operation into a O(nonzeros) operation in users code.
Rereading the comments in #6769, I'll defer to @JeffBezanson if the consensus is that it's better to silently change the performance of nnz
.
My view is based on looking at the other nnz
(now countnz
) methods, which clearly count elements such that x!=0
. Of course representation details are important for performance, but nfilled
is not a drop-in replacement for nnz
--- it doesn't mean the same thing.
Right, it's not a drop in replacement, so the deprecation calls countnz
. My point is that the name nnz
is misleading for sparse matrices, because almost anyone reading the code who hasn't read the Julia documentation would assume that this is a constant-time operation on sparse matrices. The countnz
name is much more explicit and properly describes what the function is actually doing.
It's an application of the principle of least surprise.
I consider the meaning of the operation primary, not its performance.
I guess I (admittedly not being a sparse expert) just don't see the value of trying to make people realize that nnz
is O(n). If they need that function, there's nothing they can do about it, so what's the point?
Is it the case that knowing the actual number of !=0 elements is not useful, and 90% of the time people actually want the storage size? I suspect not, but if so then I might change my mind.
In my experience when writing linear algebra operations with sparse matrices, nfilled
is the quantity of interest, not nnz
. Compare the number of times nfilled
is used in base/sparse/*
, which is 28 by my count, with number of times countnz
is used, twice.
Maybe @eldadHaber can comment on how he uses nnz
?
I have a somewhat different perspective about nnz. First, for real and complex matrices, it does not make sense to me to ask the question nonzeos(A) != 0
As we teach in any numerical analysis course, zero is not really 0. For real matrices we should be reporting abs(nonzeros(A)) <= tol
So I fail to see what is the fuss about this command. When I use nnz(A) I want to know what is the storage that is roughly needed for some variable. The number of "true" nonzeros should be asked with a tol
It seems that this perspective falls under the nfilled
use case where you were using nnz
to access the number of elements stored in the matrix, nonzero or otherwise, as opposed to counting the occurrence of exact nonzeros.
I try to say that counting nonzeros (for real and complex matrices) does not make sense anyway. Sure we can have this functionality but it is misleading. If A has real values, most around 1 and 10% of the values are smaller than 1e-16 then you still report these values as nonzeros.
@eldadHaber tolerancing and small numerical values are a separate question, that applies to any operation involving floating point numbers. Sparse algorithms are written in such a way that explicitly stored zero values don't change the output result (assuming you don't run into any overflow or nan).
I want the storage size when I'm writing a sparse algorithm - when explicit zeros need to be there, it's almost always for matrix structural reasons. For example Pardiso requires storage to already exist for elements on the diagonal - ideally these would be stored zeros when the language allows for them, but the Matlab interface to Pardiso requires hacky inaccurate workarounds like adding eps * speye(n,n)
. Prior to 2009 Petsc did something similar (see https://bitbucket.org/petsc/petsc/src/cb6910d74d53/bin/matlab/PetscBinaryWrite.m#cl-31) when saving sparse matrices from Matlab into its binary format, though slightly more reliably by using a small magic number and setting any element matching that magic number back to an explicit 0 before saving the binary file.
The Matlab interface to Ipopt is another case of this. The linear solver used within Ipopt, like the majority of sparse direct algorithms, separates the structure-based symbolic factorization allocation step from the value-based numerical factorization step. You need to evaluate the sparse Jacobian and Hessian matrices at every iteration, with the same sparsity structure but varying nonzero values. What happens if one of the "structurally nonzero" elements happens to have a value of exactly zero at some intermediate iteration? If your language automatically removes zeros from sparse matrices, you have to account for this by making a copy of the entire sparse matrix data structure at every iteration to put the nonzeros in the proper place. Bad times.
Thanks, that perspective is very helpful. In that case maybe we should use the general count
function instead of either nnz
or countnz
. Or sum(abs(A) .>= tol)
and the like.
There was a lot of discussion on this in #5538, and I agree and have been arguing that countnz
is a minority use case. That said, I don't think it's appropriate to have nnz
return the value of nfilled
, because, as @JeffBezanson mentioned, it's not the semantic meaning of the operation, which itself is general and is defined on other data structures. countnz
could certainly be extended to use a numerical tolerance, but that's a separate issue from the function names.
I think that the tolerance should be a used choice. I'm trying to construct an example to demonstrate some of these issue
It would make sense to add a tol
argument to countnz
.
What about nonzeros
? It seems one might want it to just return the nzval
array, and not filter out stored zeros. In that case it could use a rename as well.
@eldadHaber, just so you're aware: the "fuss" over this command arose because we, unlike Matlab, now allow you to store zeros in a sparse matrix. Check out the implications (following code is Matlab):
>> A = sprand(10^4, 10^4, 0.1);
>> [i,j] = ind2sub(size(A), find(A, 1));
>> tic; A(i,j) = 0.1; toc
Elapsed time is 0.000443 seconds.
>> tic; A(i,j) = 0.0; toc
Elapsed time is 0.037815 seconds.
Notice the second timing is one hundred fold longer than the first: if you set a non-zero entry to zero, Matlab actually copies the entire data for the matrix, deleting that one from memory. That's really expensive. Matlab does this because it doesn't allow you to have any stored value of a sparse matrix be identically zero.
In contrast, we allow stored zeros. That will* make this kind of thing infinitely more efficient. But that means there's a difference between the amount of storage required and how many entries are not identically zero. Given that matlab only has the nnz
command, we realized we had better differentiate these two meanings.
But breaking old code is a no-no, and usually we are reasonably good about avoiding that because of the deprecation framework. But it seems to have failed in this case.
*Currently it looks like sparsematrix/setindex!
needs updating, so it stops caring about whether the supplied value is 0.
Looks like elementwise comparisons on sparse matrices are currently outputting dense bitarrays, which might not always be the best choice. So at the moment you'll want to use A.nzval
(or nonzeros
when that gets put back in, whatever its meaning) for any tolerancing operations where the implicit zero elements return false. Possibly worth specializing elementwise comparisons on sparse matrices to return sparse Boolean matrix results, that happen to be mostly full of true's if the user did an operation that gives a mostly-dense result?
findnz
should be used for filtering out the nonzeros, so I'd agree that nonzeros
should be renamed and return the nzval
array.
@timholy I really like your example but I think its really a question of the application.
I can claim that for a sparse matrix that contains 0 values I would like to have a new sparse matrix and this can be used in the process of linear solvers and in the process of faster mat-vec products.
I think your argument is valid as well and I like the idea of letting the user decide if they want to generate a new "more economical" matrix (with these zeros removed) or using the same sparsity structure with some of the entries being 0. I can certainly see applications for both in finite volume/element discretization of PDE's.
As long as we document the two AND not retire functions that have been used before without a really good reason (where is nonzeros ...) I'm happy with both options
There are two separate things going on simultaneously here, as already discussed. For general users, we try to maintain sparse matrices without stored zeros in them. But, for developers of libraries and sophisticated usage, we do allow stored zeros.
If we introduce a flag saying whether a sparse matrix contains stored zeros, we can effectively reintroduce nnz
, which for all general usage will be fast enough. Only a few operations should need to deal with this flag. I feel that nonzeros
should just give you a copy of nzval
vector without doing a zero check, since general user generated sparse matrices should not contain zeros in them.
I think nnz
is not a great name for a function. We also have in Base count_zeros
. Having countnz
and count_zeros
cries for some name unification. I know count_zeros
is currently only available for counting the bits in an integer but beeing a generic function it could be implemented for any collection.
Introducing a flag increases the complexity of the data structure and increases the cognitive load on anyone using it, since they essentially now need to deal with two different data structures whose methods have different behavior and performance. This doesn't seem worth it just to be able to use the name nnz
, which couldn't even replace nfilled
in Base because functions in Base need to accept matrices with stored zeros. So we're just talking about user code, where it's pretty easy to define
nnz = nfilled
if they really like the name.
@ViralBShah I'm not sure I buy that --- stored zeros are either allowed or not. The idea of "structurally nonzero" entries is interesting though. Since assigning zero does remove the entry by default, how do we deal with that use case?
I've never liked the names count_zeros
and count_ones
; they are quite non-standard for those bit operations. The bit operations should be distinct from any collection operations. Better names would be bitcount
or popcount
. I'm not sure there is a standard name for counting zero bits though.
I actually think that nnz is a good command. It is used both in matlab and python's scipy so its only natural to use it here. We have to be less concern about being different, it makes the use of the language harder for newcomers.
I think that we should have a command nnz(A,tol) where nnz(A,tol) = find(abs(nonzeros(A)) <= tol)
So we have that nnz(A) = nnz(A,0) # of nonzeros stored in the array and if we are interested in the real number of nonzeros than we use tol.
This makes the use simpler, similar to other languages and can accomodate the idea of storing 0's in the matrix
By the way, similar logic work for incomplete Cholesky
Sparse matrices can store any type of object, including types which don't necessarily have a reasonable definition of abs
, so baking the notion of a tolerance into the definition of nnz
loses a lot of generality. This was also discussed in #5538 where nfilled
was chosen partially because it is sufficiently generic and allows for future extensions to real-valued sparse matrices with nonzero "default" values.
Couldn't nnz then be restricted to AbstactArray{Number}?
On Thursday, May 8, 2014, Miles Lubin notifications@github.com wrote:
Sparse matrices can store any type of object, including types which don't necessarily have a reasonable definition of abs, so baking the notion of a tolerance into the definition of nnz loses a lot of generality. This was also discussed in #5538 https://github.com/JuliaLang/julia/pull/5538where nfilled was chosen partially because it is sufficiently generic and allows for future extensions to real-valued sparse matrices with nonzero "default" values.
— Reply to this email directly or view it on GitHubhttps://github.com/JuliaLang/julia/issues/6769#issuecomment-42633879 .
@JeffBezanson The situation is not as black and white with stored zeros. In the entire sparse matrix library in Base, we filter out zeros on input and that are generated in operations. This is the usual expected behaviour for most users.
julia> s = sparse([1, 2], [1, 2], [1.0, 0.0])
2x2 sparse matrix with 1 Float64 entries:
[1, 1] = 1.0
The only way to get stored zeros is to use the SparseMatrixCSC
constructors. Basically, this is only for folks who know the underlying details of the data structure and know what they are doing. This is typically for cases where the nonzero matrix structure is known upfront and fill-in may happen in a later operation. Given that we do support stored zeros and want to make it first class, we should probably also have a better interface to create matrices with stored zeros. For example, sparse
could have a dispatch for stored_zeros=true
.
@mlubin I disagree that we should be asking users to do nnz=nfilled
, if they like the name. This is not a case of one user preferring one name, but a large number of users who are used to a certain set of names and behaviour, and this would ease the transition.
I do agree that the cognitive overload increases with a flag, and that should be avoided. It was just a thought.
Rethinking all of this, what if nfilled
were renamed to nnz
and we make it clear that it is the number of structural nonzeros as suggested above? nonzeros
then returns all the structural nonzeros. In my opinion, one name is not necessarily better or clearer than the other. For users who do not care about stored zeros, they will never run into the issue. For those used to matlab and octave, the transition will be better with familiar names. For those who do need to use stored zeros, they will know that these are structural nonzeros.
@eldadHaber As discussed here, the nnz
is typically about structural nonzeros.
@mlubin @kmsquire We have a bunch of sparse capabilities that assume that it is a sparse matrix of numbers. It would be good to restrict certain stuff like linear algebra to numbers only, where the operations do not generalize further. What we really need is an N-d coo
format, which will be far more general for storing objects in a sparse array container.
I really like your proposal to return nnz and nonzeros
I am one of the users that like the idea of nonzero pattern (use it in FEM discretization of PDE's) but I think that you are right and it is for more sophisticated users. There is much to say about compatibility with matlab, scipy and octave
Compatibility with matlab (and scipy) has not really been a strong reason for naming things when you look at other issues. The deprecation of max
in favor of maximum
is a recent example that in my opinion has a lot larger impact than the deprecation of nnz
(and I am not happy about maximum
but understand the reasoning behind it).
On further thing is that multiple dispatch quite naturally enforces different naming in various situations because the type that an operation is exposed to does not have to be encoded into the function name. I contributed some code to the Gtk.jl project and 90% of the time I overloaded existing methods (e.g. push!
for adding something to a GtkListStore). I could have tried to use similar naming as in the Python Gtk wrapper (e.g. use add
instead of push!
) but this would not have been the Julian way (and Jameson would of course not have accepted the patch).
I actually don't have a strong feeling in the case of nnz
. I just want to make clear that from what I have seen matlab compatibility has not been a very important thing in the past. @StefanKarpinski might correct me if I am wrong here. I just think it would be fair for Eldad that he does not have too high expectations on matlab compatibility.
PS. Searching this issue tracker for meshgrid
also gives a discussion on matlab compatibility. Reference semantic on arrays vs COW in matlab is also a major difference.
You are absolutely right - Compatibility with matlab or any other system is not an important driver for choosing names in Julia. However, we retain compatible names so long as they are meaningful, and not kludgy. We often change names when something is confusing or poorly named.
On balance, IMO, going with the structural nonzero rationalization, nnz
and nonzeros
are good names and behave as expected for all.
Yes but you can't see nnz
in isolation. It is not optimal that we have nnz
, count_zeros
and count_ones
as generic functions in Base.
nnz
and count_ones
/count_zeros
are similar sounding operations, but I do not believe that these names need unification. The count_ones
/count_zeros
names are not great, and bitcount
was what struck me as a better name.
I'm okay with renaming nfilled
to nnz
now that the countnz
functionality has been separated out.
Here's me trying to ruin the party as usual: if nonzeros
means structural non-zeros, then do we have
nonzeros(A::Array) = A
?
I'd argue that nnz
shouldn't be defined for generic arrays, countnz
should be used instead. nonzeros
is less clear...
Someone has decided to change nnz and to remove the nonzeros command.
I find this very unacceptable! It is a bad practice to not allow for backward compatibility. Many of our codes use these commands and we have to go and replace them in all our previous versions.
We are investing a considerable amount of time, trying to develop codes in julia for geophysical inversion and truly believe that it is the way to go. If new julia versions are not going to be backward compatible we will need to rethink our development.
Please think about this whenever you want to retire a command