JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.6k stars 5.48k forks source link

`filter!` corrupts `SparseVector` #53920

Open CameronBieganek opened 6 months ago

CameronBieganek commented 6 months ago

If you try to use filter! on a SparseVector, it fails with an internal method error for deleteat!. And then your sparse vector is corrupted:

julia> using SparseArrays

julia> x = sparsevec(1:2:7, 1:4, 8)
8-element SparseVector{Int64, Int64} with 4 stored entries:
  [1]  =  1
  [3]  =  2
  [5]  =  3
  [7]  =  4

julia> filter!(iseven, x)
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::UnitRange{Int64})

Closest candidates are:
  deleteat!(::Vector, ::AbstractUnitRange{<:Integer})
   @ Base array.jl:1525
  deleteat!(::Vector, ::AbstractVector)
   @ Base array.jl:1567
  deleteat!(::Vector, ::Any)
   @ Base array.jl:1566
  ...

Stacktrace:
 [1] filter!(f::typeof(iseven), a::SparseVector{Int64, Int64})
   @ Base .\array.jl:2662
 [2] top-level scope
   @ REPL[23]:1

julia> x
8-element SparseVector{Int64, Int64} with 6 stored entries:
  [1]  =  0
  [2]  =  2
  [3]  =  0
  [4]  =  0
  [5]  =  4
  [7]  =  4
jakobnissen commented 6 months ago

Is this really a bug? This issue can happen whenever you have a generic function that temporarily causes its arguments to be in an invalid state - because it's always a possibility that the later operations that restore the correct state to the argument throws an error. So there are only two possibilities, really:

  1. Never write generic methods that can cause an argument to be in an invalid state temporarily
  2. Accept that after an exception is thrown in a function, you can have an invalid state

The first option is quite extreme and has far-reaching implications.

CameronBieganek commented 6 months ago

I don't know if there is a generally accepted meaning for filtering a sparse vector/array, but the two options that seem reasonable are 1) zero out the entries for which the predicate returns false, or 2) return a dense array with the entries for which the predicate returns true. It turns out that even filter does not do either of the two reasonable things:

julia> x = sparsevec(1:2:7, 1:4, 8)
8-element SparseVector{Int64, Int64} with 4 stored entries:
  [1]  =  1
  [3]  =  2
  [5]  =  3
  [7]  =  4

julia> filter(iseven, x)
6-element SparseVector{Int64, Int64} with 2 stored entries:
  [2]  =  2
  [5]  =  4

Note that the indices get messed up and the sparse vector changes from length 8 to length 6.

generic function

The root issue here is the underspecification of interfaces in Julia. The filter and filter! functions are not as generic as they claim to be:

julia> x = sparsevec(1:2:7, 1:4, 8);

julia> @which filter(iseven, x)
filter(f, a::AbstractArray)
     @ Base array.jl:2617

julia> @which filter!(iseven, x)
filter!(f, a::AbstractVector)
     @ Base array.jl:2651

They do not work for arbitrary instances of AbstractVector. (This is how I came across this issue---I've been exploring interface questions.)

Perhaps it would make sense to add ::SparseVector methods for filter and filter!? That doesn't solve the interface issue, but at least it covers one more AbstractVector case.

CameronBieganek commented 6 months ago

Is this really a bug? This issue can happen whenever you have a generic function that temporarily causes its arguments to be in an invalid state - because it's always a possibility that the later operations that restore the correct state to the argument throws an error.

To more specifically respond to your question, I would expect the following to work, but it does not:

julia> x = sparsevec(1:2:7, 1:4, 8)
8-element SparseVector{Int64, Int64} with 4 stored entries:
  [1]  =  1
  [3]  =  2
  [5]  =  3
  [7]  =  4

julia> x[3]
2

julia> try
           filter!(iseven, x)
       catch
       end

julia> x[3]
0

But perhaps that expectation is unreasonable. 🤷‍♂️ Maybe I just have to accept the fact that Julia is a procedural language with lots of side effects.

jakobnissen commented 6 months ago

Regarding filter(f, ::SparseVector), I don't understand why you would expect it to give you a dense vector, or zero out some entries. That's not how filter works, and it's not how it's documented to work. It's documented to return a copy of the collection without elements for which f returns false, and that's exactly what it does. From what I can tell, filter behaves correctly.

Regarding the try-catch thing, that's not possible to have that work in general. For that to work, there would need to be some enforcement that instances of some abstract type implements some given methods. There has been some discussion about this, but it's not part of Julia yet.

adienes commented 6 months ago

But perhaps that expectation is unreasonable. 🤷‍♂️ Maybe I just have to accept the fact that Julia is a procedural language with lots of side effects.

I mean

isn't that the whole point of distinguishing filter from filter! ? you can get the behavior you want by just

x2 = copy(x)
filter!(iseven, x2)
x = x2
CameronBieganek commented 6 months ago

It's documented to return a copy of the collection without elements for which f returns false, and that's exactly what it does.

No, it mangles the keys:

julia> x = sparsevec(1:2:7, 1:4, 8)
8-element SparseVector{Int64, Int64} with 4 stored entries:
  [1]  =  1
  [3]  =  2
  [5]  =  3
  [7]  =  4

julia> filter(iseven, x)
6-element SparseVector{Int64, Int64} with 2 stored entries:
  [2]  =  2
  [5]  =  4

Compare that to filter on a Dict, which preserves the keys:

julia> filter(p -> iseven(p[2]), Dict(:a => 1, :b => 2))
Dict{Symbol, Int64} with 1 entry:
  :b => 2

I know it's a little different because dictionaries iterate pairs and sparse vectors just iterate elements, but in my opinion a sane filter on a sparse vector should preserve the keys.

The filter docstring says

Return a copy of collection a, removing elements for which f is false.

But that's a fundamentally undefined sentence in Julia, because there is no definition of collection. The object returned by filter(iseven, x) is a pretty useless "copy" of the sparse vector x, since the indices of the values have changed.

I thought I was an experienced Julia programmer, but apparently I have not been admitted to the enlightened circle of Julia developers who understand the true meaning of Julia's vague docstrings.

you can get the behavior you want by just

x2 = copy(x)
filter!(iseven, x2)
x = x2

That obviously doesn't work:

julia> x = sparsevec(1:2:7, 1:4, 8)
8-element SparseVector{Int64, Int64} with 4 stored entries:
  [1]  =  1
  [3]  =  2
  [5]  =  3
  [7]  =  4

julia> x2 = copy(x); filter!(iseven, x2)
ERROR: MethodError: no method matching deleteat!(::SparseVector{Int64, Int64}, ::UnitRange{Int64})

Closest candidates are:
  deleteat!(::Vector, ::AbstractUnitRange{<:Integer})
   @ Base array.jl:1525
  deleteat!(::Vector, ::AbstractVector)
   @ Base array.jl:1567
  deleteat!(::Vector, ::Any)
   @ Base array.jl:1566
  ...

Stacktrace:
 [1] filter!(f::typeof(iseven), a::SparseVector{Int64, Int64})
   @ Base .\array.jl:2662
 [2] top-level scope
   @ REPL[29]:1

julia> x = x2; x
8-element SparseVector{Int64, Int64} with 6 stored entries:
  [1]  =  0
  [2]  =  2
  [3]  =  0
  [4]  =  0
  [5]  =  4
  [7]  =  4
jakobnissen commented 6 months ago

The indices of the values change, because the removed elements are removed, not set to zero. If you remove e.g. the element at index 2 from a vector, then what was previously at element 3 will move down to be in position 2. This is the same as an ordinary Vector. A sparse vector is not like a Dict - it's rather an alternative implementation of something like Vector. If the indices were preserved, then no elements would have been removed, which would have been wrong according to the docstring. In other words, the elements not printed when you display the sparse vector are not missing elements. They exist, but are zero. This is different from a Dict.

adienes commented 6 months ago

That obviously doesn't work:

I stand corrected -- I should have tried it before posting

but apparently I have not been admitted to the enlightened circle of Julia developers who understand the true meaning of Julia's vague docstrings.

I apologize for having implied any such thing

w.r.t. to the example at hand

julia> x = sparsevec(1:2:7, 1:4, 8)
8-element SparseVector{Int64, Int64} with 4 stored entries:
  [1]  =  1
  [3]  =  2
  [5]  =  3
  [7]  =  4

julia> filter(iseven, x)
6-element SparseVector{Int64, Int64} with 2 stored entries:
  [2]  =  2
  [5]  =  4

I agree this is pretty confusing, and I had to do a double take. although after staring at it a bit it does seem correct. what was tripping me up was forgetting that of course in this case the iseven predicate includes the zeros. The behavior seems a little bit more clear when I make the predicate exclude zeros

julia> filter(v -> iseven(v) && !iszero(v), x)
2-element SparseVector{Int64, Int64} with 2 stored entries:
  [1]  =  2
  [2]  =  4
mbauman commented 6 months ago

I agree this is a bug — it's not just sparse, this method will corrupt all vectors that are mutable but not mutably-sized. And SparseVectors do not have a mutable length, so they cannot implement this even if they wanted to (they're structs).

There are three classes of functions:

I may have missed a few, but this isn't an abstract "yeah, anything can fail in the middle" sort of thing. We know that many AbstractVectors are mutable but not mutable-length. I think all these methods should check for applicability of what they expect to use to resize! before doing any preparatory work.

CameronBieganek commented 6 months ago

The indices of the values change, because the removed elements are removed, not set to zero. If you remove e.g. the element at index 2 from a vector, then what was previously at element 3 will move down to be in position 2. This is the same as an ordinary Vector.

Ok, I can agree with this logic, which explains the filter behavior, but I still think that the filter! issue is due to the fact that the AbstractVector interface is too coarse to represent the kinds of vectors for which the current implementation actually works. (What I mean is not that the AbstractVector interface needs more mandatory methods added to it, but that in an ideal world we would have a sub-interface, with additional methods in it, that describes the expected interface for vector-like arguments to filter!.)

I apologize for having implied any such thing

Sorry, I was being a bit hyperbolic, and I was not implicating you. 😉