JuliaSparse / SparseArrays.jl

SparseArrays.jl is a Julia stdlib
https://sparsearrays.juliasparse.org/
Other
88 stars 51 forks source link

Regression with arithmetic operations on sparse matrices #57

Open jebej opened 7 years ago

jebej commented 7 years ago

Possibly related to JuliaLang/julia#21370:

On 0.5.1:

julia> A = sprand(10,10,0.2);

julia> @benchmark 3.0*A
BenchmarkTools.Trial:
  memory estimate:  672 bytes
  allocs estimate:  4
  --------------
  minimum time:     139.775 ns (0.00% GC)
  median time:      153.360 ns (0.00% GC)
  mean time:        174.615 ns (9.75% GC)
  maximum time:     1.132 μs (79.94% GC)
  --------------
  samples:          10000
  evals/sample:     870

julia> @benchmark 3.0+A
BenchmarkTools.Trial:
  memory estimate:  1.75 KiB
  allocs estimate:  2
  --------------
  minimum time:     237.948 ns (0.00% GC)
  median time:      267.440 ns (0.00% GC)
  mean time:        303.627 ns (8.86% GC)
  maximum time:     1.451 μs (66.51% GC)
  --------------
  samples:          10000
  evals/sample:     464

julia> @benchmark A/3.0
BenchmarkTools.Trial:
  memory estimate:  672 bytes
  allocs estimate:  4
  --------------
  minimum time:     187.120 ns (0.00% GC)
  median time:      197.874 ns (0.00% GC)
  mean time:        220.143 ns (8.00% GC)
  maximum time:     1.349 μs (81.50% GC)
  --------------
  samples:          10000
  evals/sample:     723

On 0.6

julia> A = sprand(10,10,0.2);

julia> @benchmark 3.0*A
BenchmarkTools.Trial:
  memory estimate:  720 bytes
  allocs estimate:  7
  --------------
  minimum time:     242.706 ns (0.00% GC)
  median time:      261.936 ns (0.00% GC)
  mean time:        296.619 ns (9.39% GC)
  maximum time:     3.182 μs (80.41% GC)
  --------------
  samples:          10000
  evals/sample:     469

julia> @benchmark 3.0+A
BenchmarkTools.Trial:
  memory estimate:  2.02 KiB
  allocs estimate:  7
  --------------
  minimum time:     618.483 ns (0.00% GC)
  median time:      659.125 ns (0.00% GC)
  mean time:        727.738 ns (7.16% GC)
  maximum time:     5.575 μs (81.81% GC)
  --------------
  samples:          10000
  evals/sample:     176

julia> @benchmark A/3.0
BenchmarkTools.Trial:
  memory estimate:  720 bytes
  allocs estimate:  7
  --------------
  minimum time:     281.785 ns (0.00% GC)
  median time:      303.701 ns (0.00% GC)
  mean time:        339.489 ns (8.10% GC)
  maximum time:     4.909 μs (86.33% GC)
  --------------
  samples:          10000
  evals/sample:     298
Julia Version 0.6.0-pre.beta.253
Commit 6e70552* (2017-04-22 13:14 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, sandybridge)
ViralBShah commented 7 years ago

We have had several reports of sparse matrix regressions. Probably worth taking a good look at what is going on and expanding the sparse matrix benchmarks in test/perf.

tkelman commented 7 years ago

test/perf is not especially useful, it very rarely gets run - BaseBenchmarks.jl is where we should add things so nanosoldier tracks them

Sacha0 commented 7 years ago

Results are similar when interpolating A into the benchmarks:

julia> versioninfo()
Julia Version 0.5.1
Commit 6445c82* (2017-03-05 13:25 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.7.1 (ORCJIT, ivybridge)

julia> A = sprand(10, 10, 0.2);

julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     849
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  736.00 bytes
  allocs estimate:  4
  minimum time:     146.00 ns (0.00% GC)
  median time:      166.00 ns (0.00% GC)
  mean time:        215.95 ns (15.66% GC)
  maximum time:     2.27 μs (88.80% GC)

julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     282
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  1.75 kb
  allocs estimate:  2
  minimum time:     290.00 ns (0.00% GC)
  median time:      319.00 ns (0.00% GC)
  mean time:        390.04 ns (10.10% GC)
  maximum time:     4.27 μs (0.00% GC)

julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     685
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  736.00 bytes
  allocs estimate:  4
  minimum time:     186.00 ns (0.00% GC)
  median time:      206.00 ns (0.00% GC)
  mean time:        260.94 ns (13.07% GC)
  maximum time:     2.91 μs (85.06% GC)

and

julia> versioninfo()
Julia Version 0.6.0-pre.beta.305
Commit f56147d* (2017-04-24 18:38 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin15.6.0)
  CPU: Intel(R) Core(TM) i7-3520M CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, ivybridge)

julia> A = sprand(10, 10, 0.2);

julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
  memory estimate:  784 bytes
  allocs estimate:  7
  --------------
  minimum time:     238.269 ns (0.00% GC)
  median time:      258.608 ns (0.00% GC)
  mean time:        329.116 ns (12.51% GC)
  maximum time:     4.580 μs (86.26% GC)
  --------------
  samples:          10000
  evals/sample:     438

julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
  memory estimate:  2.02 KiB
  allocs estimate:  7
  --------------
  minimum time:     644.115 ns (0.00% GC)
  median time:      707.595 ns (0.00% GC)
  mean time:        858.236 ns (8.50% GC)
  maximum time:     8.213 μs (80.75% GC)
  --------------
  samples:          10000
  evals/sample:     174

julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
  memory estimate:  784 bytes
  allocs estimate:  7
  --------------
  minimum time:     256.019 ns (0.00% GC)
  median time:      281.244 ns (0.00% GC)
  mean time:        355.654 ns (11.75% GC)
  maximum time:     5.543 μs (89.03% GC)
  --------------
  samples:          10000
  evals/sample:     362

Also note that discrepancies persist in the multiplication and division cases for more reasonably sized (though still small) sparse matrices, with on 0.5.1:

julia> A = sprand(1000, 1000, 0.01);

julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  167.58 kb
  allocs estimate:  6
  minimum time:     16.95 μs (0.00% GC)
  median time:      25.26 μs (0.00% GC)
  mean time:        50.63 μs (29.92% GC)
  maximum time:     3.43 ms (77.29% GC)

julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
  samples:          795
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  15.26 mb
  allocs estimate:  4
  minimum time:     4.10 ms (0.00% GC)
  median time:      6.38 ms (28.68% GC)
  mean time:        6.28 ms (27.32% GC)
  maximum time:     9.35 ms (42.09% GC)

julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
  samples:          10000
  evals/sample:     1
  time tolerance:   5.00%
  memory tolerance: 1.00%
  memory estimate:  167.58 kb
  allocs estimate:  6
  minimum time:     28.98 μs (0.00% GC)
  median time:      36.61 μs (0.00% GC)
  mean time:        61.14 μs (22.96% GC)
  maximum time:     2.96 ms (75.51% GC)

and master

julia> A = sprand(1000, 1000, 0.01);

julia> @benchmark 3.0 * $A
BenchmarkTools.Trial:
  memory estimate:  167.75 KiB
  allocs estimate:  9
  --------------
  minimum time:     28.489 μs (0.00% GC)
  median time:      37.700 μs (0.00% GC)
  mean time:        63.382 μs (23.70% GC)
  maximum time:     3.182 ms (97.13% GC)
  --------------
  samples:          10000
  evals/sample:     1

julia> @benchmark 3.0 + $A
BenchmarkTools.Trial:
  memory estimate:  15.27 MiB
  allocs estimate:  9
  --------------
  minimum time:     3.022 ms (0.00% GC)
  median time:      5.242 ms (33.51% GC)
  mean time:        5.098 ms (31.28% GC)
  maximum time:     76.732 ms (94.91% GC)
  --------------
  samples:          978
  evals/sample:     1

julia> @benchmark $A / 3.0
BenchmarkTools.Trial:
  memory estimate:  167.75 KiB
  allocs estimate:  9
  --------------
  minimum time:     40.635 μs (0.00% GC)
  median time:      50.794 μs (0.00% GC)
  mean time:        78.106 μs (19.19% GC)
  maximum time:     3.396 ms (95.61% GC)
  --------------
  samples:          10000
  evals/sample:     1
Sacha0 commented 7 years ago

The overarching reason for these performance changes: In 0.5, arithmetic operations between sparse matrices and numbers translate to equivalent dot operations, e.g. 3.0 + A translates to 3.0 .+ A. Each such dot operation has a dedicated specialization, e.g. 3.0 .+ A is implemented as 3.0 .+ full(A). In 0.6, however, all arithmetic operations between sparse matrices and numbers translate to equivalent broadcast calls and thereby hit generic sparse broadcast. Consequently both semantics and performance have changed.

Note that the former specializations were extremely simple and not universally correct. For example, the specializations that the multiplication and division examples above hit just applied the corresponding operations to the sparse matrix's nonzero vector, yielding incorrect results in corner cases such as the scalar being Inf, NaN, 0, et cetera. Similarly, as noted above the addition example formerly fell back to dense methods.

Given the semantic changes, I'm not certain the performance comparison is particularly meaningful. (That said, some tweaks to generic sparse broadcast I have in mind might improve performance --- though by a couple tens of percent at best I wager). Best!

jebej commented 7 years ago

Couldn't we have something like this for multiplication and division:

julia> sp_mul_number(A,n) = SparseMatrixCSC(A.m,A.n,A.colptr,A.rowval,n*A.nzval)
sp_mul_number (generic function with 1 method)

julia> @benchmark 3.26*$A
BenchmarkTools.Trial:
  memory estimate:  3.67 KiB
  allocs estimate:  7
  --------------
  minimum time:     551.559 ns (0.00% GC)
  median time:      614.414 ns (0.00% GC)
  mean time:        685.501 ns (8.68% GC)
  maximum time:     3.800 μs (68.65% GC)
  --------------
  samples:          10000
  evals/sample:     186

julia> @benchmark sp_mul_number($A,3.26)
BenchmarkTools.Trial:
  memory estimate:  48 bytes
  allocs estimate:  1
  --------------
  minimum time:     51.266 ns (0.00% GC)
  median time:      51.561 ns (0.00% GC)
  mean time:        53.618 ns (2.20% GC)
  maximum time:     776.661 ns (83.50% GC)
  --------------
  samples:          10000
  evals/sample:     992

julia> isequal(3.26*A,sp_mul_number(A,3.26))
true
Sacha0 commented 7 years ago

That definition is similar in spirit to the above-mentioned specializations existing in 0.5:

(.*)(A::SparseMatrixCSC, B::Number) = SparseMatrixCSC(A.m, A.n, copy(A.colptr), copy(A.rowval), A.nzval .* B)
(.*)(A::Number, B::SparseMatrixCSC) = SparseMatrixCSC(B.m, B.n, copy(B.colptr), copy(B.rowval), A .* B.nzval)
(./)(A::SparseMatrixCSC, B::Number) = SparseMatrixCSC(A.m, A.n, copy(A.colptr), copy(A.rowval), A.nzval ./ B)

(You need copy the colptr and rowval arrays to avoid incorrect aliasing.) Unfortunately such definitions suffer from the semantic issues also mentioned above (e.g. failure in corner cases). Best!

jebej commented 7 years ago

Right, sorry I hadn't understood your reply.

What would be the correct way to deal with the number being infinite or NaN (etc)? What is broadcast now doing?

I'm assuming that simply coding in the behaviour for these special cases would not work.

Sacha0 commented 7 years ago

Right, sorry I hadn't understood your reply.

No worries! :)

What would be the correct way to deal with the number being infinite or NaN (etc)? What is broadcast now doing?

broadcast returns the result of performing the scalar arithmetic operation elementwise.

I'm assuming that simply coding in the behaviour for these special cases would not work.

Writing specializations of one form or another for each such operation is of course possible. But the net result would be a collection of ad hoc methods closely resembling both each other and sparse broadcast (avoiding which was part of the purpose of the changes between 0.5 and 0.6 relevant to this issue). And how much performance improvement those specializations might yield (while retaining the correct semantics) is not clear. Chances are effort would be better spent refining sparse broadcast.

For example, improving the mechanism that handles scalars in sparse broadcast could yield ~20% runtime reduction on these benchmarks:

julia> @benchmark broadcast(x -> 3.0 * x, $A)
BenchmarkTools.Trial:
  memory estimate:  896 bytes
  allocs estimate:  4
  --------------
  minimum time:     225.231 ns (0.00% GC)
  median time:      239.024 ns (0.00% GC)
  mean time:        297.230 ns (11.20% GC)
  maximum time:     3.230 μs (84.41% GC)
  --------------
  samples:          10000
  evals/sample:     455

julia> @benchmark broadcast(*, 3.0, $A)
BenchmarkTools.Trial:
  memory estimate:  944 bytes
  allocs estimate:  7
  --------------
  minimum time:     279.579 ns (0.00% GC)
  median time:      295.338 ns (0.00% GC)
  mean time:        374.913 ns (11.91% GC)
  maximum time:     5.898 μs (85.46% GC)
  --------------
  samples:          10000
  evals/sample:     318

julia> @benchmark broadcast(x -> x / 3.0, $A)
BenchmarkTools.Trial:
  memory estimate:  896 bytes
  allocs estimate:  4
  --------------
  minimum time:     263.210 ns (0.00% GC)
  median time:      280.946 ns (0.00% GC)
  mean time:        342.693 ns (9.78% GC)
  maximum time:     3.941 μs (74.48% GC)
  --------------
  samples:          10000
  evals/sample:     372

julia> @benchmark broadcast(/, $A, 3.0)
BenchmarkTools.Trial:
  memory estimate:  944 bytes
  allocs estimate:  7
  --------------
  minimum time:     300.787 ns (0.00% GC)
  median time:      320.508 ns (0.00% GC)
  mean time:        400.012 ns (10.91% GC)
  maximum time:     6.291 μs (86.89% GC)
  --------------
  samples:          10000
  evals/sample:     258

Best!

andreasnoack commented 7 years ago

@Sacha0 Would it be possible to get back to 0.5 performance if structural zeros didn't follow IEEE in broadcast?

jebej commented 7 years ago

@Sacha0 how is scaling the vector of nonzero values different than applying the same operation elementwise?

If multiplying by zeros, all the nonzero values become 0 and the resulting sparse matrix is empty. If multiplying by infinity, nonzero values become infinite. The only case I can see a difference would be for NaN, where the whole matrix should become filled with NaNs. Is that correct?

martinholters commented 7 years ago

If multiplying by zeros, all the nonzero values become 0 and the resulting sparse matrix is empty.

Not quite: Inf and NaN entries become NaN.

The only case I can see a difference would be for NaN, where the whole matrix should become filled with NaNs.

Also when multiplying with Inf, zero becomes NaN, so the whole matrix will be filled with a mixture of Infs and NaNs.

Sacha0 commented 7 years ago

Would it be possible to get back to 0.5 performance if structural zeros didn't follow IEEE in broadcast?

Without also nixing the other semantic changes? My (approximately meaningless) guess is no: Outperforming [uniformly scaling a vector] while [doing more than uniformly scaling a vector] seems tricky :wink:.

It might be worth noting that, insofar as I am aware, these performance regressions are tightly confined, specifically to operations that have (over-)aggressive specializations in 0.5 (e.g. * and / between pairs of numbers and sparse matrices); outside such cases, 0.6's performance should meet or best 0.5's performance.

Additionally, using dot ops directly (rather than via * and /) in the benchmarks above for * and / closes most of the gap. To illustrate, on 0.5

julia> median(@benchmark 3.0 * $A)
BenchmarkTools.TrialEstimate:
  time:             161.932 ns
  gctime:           0.000 ns (0.00%)
  memory:           640 bytes
  allocs:           4

while on 0.6

julia> median(@benchmark 3.0 .* $A)
BenchmarkTools.TrialEstimate:
  time:             189.452 ns
  gctime:           0.000 ns (0.00%)
  memory:           640 bytes
  allocs:           4

julia> median(@benchmark 3.0 * $A)
BenchmarkTools.TrialEstimate:
  time:             233.724 ns
  gctime:           0.000 ns (0.00%)
  memory:           688 bytes
  allocs:           7

Furthermore, if you make the operation even marginally more complex, 0.6 should win. To illustrate, on 0.5

julia> median(@benchmark 3.0 .* $A .* 3.0)
BenchmarkTools.TrialEstimate:
  time:             336.352 ns
  gctime:           0.000 ns (0.00%)
  memory:           1.25 KiB
  allocs:           8

while on 0.6

julia> median(@benchmark 3.0 .* $A .* 3.0)
BenchmarkTools.TrialEstimate:
  time:             192.012 ns
  gctime:           0.000 ns (0.00%)
  memory:           640 bytes
  allocs:           4

And there still exist opportunities to whittle away at any such gaps without changing semantics. Best!

andreasnoack commented 7 years ago

using dot ops directly (rather than via and /) in the benchmarks above for and / closes most of the gap

This is because of the literal. With a variable, it doesn't matter. I'm not sure why the two argument broadcast with a scalar would be slower than the one argument broadcast.

Sacha0 commented 7 years ago

This is because of the literal.

Yes, exactly :). Sparse broadcast presently handles scalars by: (1) capturing scalars in a closure (rather, a series of nested closures) while collecting the non-scalar arguments; and (2) calling sparse broadcast to apply that closure to the collected non-scalar arguments. For reasons that remain somewhat mysterious to me, introducing such closures degrades performance somewhat. Inlining the numeric literal avoids going through the closure-generating code, so using dot ops achieves better performance as shown above. (The relevant code is here, and the ins and outs of incorporating scalars via this approach were discussed at length in the pull request and linked prior threads.)

If you could shed light on the closure-associated performance issues, I would much appreciate your insight!

With a variable, it doesn't matter.

Though not in the case of s .* A/s * A with s = 3.0, it does matter in even marginally more complex cases. For example, on 0.5

julia> @benchmark $s .* $A .* $s
BenchmarkTools.Trial:
  memory estimate:  1.38 KiB
  allocs estimate:  8
  --------------
  minimum time:     313.444 ns (0.00% GC)
  median time:      339.560 ns (0.00% GC)
  mean time:        429.646 ns (14.17% GC)
  maximum time:     5.812 μs (89.75% GC)
  --------------
  samples:          10000
  evals/sample:     252

julia> @benchmark $s * $A * $s
BenchmarkTools.Trial:
  memory estimate:  1.38 KiB
  allocs estimate:  8
  --------------
  minimum time:     319.040 ns (0.00% GC)
  median time:      342.202 ns (0.00% GC)
  mean time:        432.209 ns (14.11% GC)
  maximum time:     6.319 μs (89.28% GC)
  --------------
  samples:          10000
  evals/sample:     247

while on 0.6

julia> @benchmark $s .* $A .* $s
BenchmarkTools.Trial:
  memory estimate:  672 bytes
  allocs estimate:  7
  --------------
  minimum time:     266.410 ns (0.00% GC)
  median time:      284.621 ns (0.00% GC)
  mean time:        346.397 ns (10.20% GC)
  maximum time:     5.074 μs (88.84% GC)
  --------------
  samples:          10000
  evals/sample:     383

julia> @benchmark $s * $A * $s
BenchmarkTools.Trial:
  memory estimate:  1.28 KiB
  allocs estimate:  14
  --------------
  minimum time:     435.910 ns (0.00% GC)
  median time:      467.676 ns (0.00% GC)
  mean time:        576.297 ns (11.24% GC)
  maximum time:     9.801 μs (80.05% GC)
  --------------
  samples:          10000
  evals/sample:     199

I'm not sure why the two argument broadcast with a scalar would be slower than the one argument broadcast.

Were the closure mechanism above zero-cost, then the former and latter should be comparable (given the former wraps the scalar into a closure and then calls the latter.) Any insight would be much appreciated :). Best!

mfalt commented 7 years ago

I am assuming that this is related, otherwise I can open a new issue, but it seems like structural zeros are not preserved in some cases, for example in scale2! two below.

function scale1!(A,b)
A .= A.*(1.0./b)
end

function scale2!(A,b)
tmp = 1.0./b
A .= A.*tmp
end
julia> A0 = sprandn(100,100000,0.001);

julia> b = randn(100);

julia> A = copy(A0);

julia> @time nnz(scale1!(A,b))
  0.433009 seconds (225.66 k allocations: 167.108 MiB, 16.55% gc time)
10000000 #Note full matrix!!!

julia> A = copy(A0);

julia> @benchmark scale1!(A,b)
BenchmarkTools.Trial: 
  memory estimate:  6.44 KiB
  allocs estimate:  11
  --------------
  minimum time:     55.652 ms (0.00% GC)
  median time:      56.000 ms (0.00% GC)
  mean time:        56.305 ms (0.00% GC)
  maximum time:     61.173 ms (0.00% GC)
  --------------
  samples:          89
  evals/sample:     1

julia> A = copy(A0);

julia> @time nnz(scale2!(A,b))
  0.034621 seconds (15.75 k allocations: 729.528 KiB)
10028

julia> A = copy(A0);

julia> @benchmark scale2!(A,b)
BenchmarkTools.Trial: 
  memory estimate:  8.34 KiB
  allocs estimate:  36
  --------------
  minimum time:     13.837 ms (0.00% GC)
  median time:      13.986 ms (0.00% GC)
  mean time:        14.674 ms (0.00% GC)
  maximum time:     21.433 ms (0.00% GC)
  --------------
  samples:          341
  evals/sample:     1

I was not able to understand why @benchmark reports good values when the code is slow in practice. The issue is not related to the assignment to A, a non-inplace version has the same drawback, but the preformance loss is not as significant because of the extra allocations.

A related problem, maybe even worse (if possible), is that

function scale1!(A,b)
A .*= (1.0./b)
end

will result in a full NaN matrix!

Ran on

Julia Version 0.6.0-pre.beta.397
Commit 991b581* (2017-04-28 08:52 UTC)
Platform Info:
  OS: Linux (x86_64-redhat-linux)
  CPU: Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, haswell)

Edit: Verified on latest version of julia

Julia Version 0.7.0-DEV.2
Commit b6e4e70* (2017-05-02 15:35 UTC)

with same behavior, although slightly less significant time-wise.

Sacha0 commented 7 years ago

it seems like structural zeros are not preserved in some cases, for example in scale2! two below.

Did you mean scale1!? (scale2! preserves structural zeros, whereas scale1! does not.) Simpler illustration of the difference in behavior:

julia> A = sprandn(10, 10, 0.1); nnz(A)
9

julia> b = randn(10);

julia> nnz(A ./ b) # like scale1!
100

julia> nnz(A .* b) # like scale2!
9

Explanation: / is not strictly zero-preserving, i.e. 0/0 == NaN != 0. Generic sparse broadcast yields a dense matrix in sparse form for functions that are not strictly zero-preserving. On the other hand, * is strictly zero-preserving, i.e. 0*0 == 0. Generic sparse broadcast yields a truly sparse result for functions that are strictly zero-preserving.

In essence you've hit the issues I mention in the second section of https://github.com/JuliaLang/julia/pull/18590#issuecomment-249401774 and again in https://github.com/JuliaLang/julia/pull/18590#issuecomment-249447622.

I was not able to understand why @benchmark reports good values when the code is slow in practice.

Not certain I follow --- could you clarify? :)

will result in a full NaN matrix!

Yes, the issue with e.g. A ./= b is that the input arguments alias the output argument, and sparse broadcast does not yet guard against madness ensuing in that case. (Opened an issue to track that behavior: JuliaLang/julia#21693.)

Best!

mfalt commented 7 years ago

Did you mean scale1!? (scale2! preserves structural zeros, whereas scale1! does not.)

Yes, sorry.

Simpler illustration of the difference in behavior:

Sure, but part of the point was to also show that the expression A.*(1.0./b) will not give the same result, as using a temp variable, even though it should in some sense be equivalent.

Explanation: / is not strictly zero-preserving, i.e. 0/0 == NaN != 0. Generic sparse broadcast yields a dense matrix in sparse form for functions that are not strictly zero-preserving. On the other hand, is strictly zero-preserving, i.e. 00 == 0. Generic sparse broadcast yields a truly sparse result for functions that are strictly zero-preserving.

Thanks for the explanation, I was able to figure out that this is what's probably being done, but the code is not trivial to read. My understanding is that the whole resulting matrix is NaN filled and that everything that is not a result of / between structural zeros is calculated and updated. Is there a good reason why we are allocating and filling the whole result with NaN instead of only the parts corresponding to structural zeros (in this case structural zeros in b)? I understand the reasoning behind assuming that most elements in sparse arrays are zeros, for efficiency reasons, but then I don't think that a dense b should be converted to be sparse and run through this method.

mfalt commented 7 years ago

I was not able to understand why @benchmark reports good values when the code is slow in practice.

Not certain I follow --- could you clarify? :)

Sure, what I meant is that @benchmark estimates approximately a factor 4 slowdown and a reduction of memory usage for scale1! compared to scale2!, but @time is able to measure the actual slowdown of a factor 10+ and the increase in memory allocation of a factor 200+.

martinholters commented 7 years ago

Might be related to benchmarking at global scope, which generally is to be avoided.

Sacha0 commented 7 years ago

My understanding is that the whole resulting matrix is NaN filled and that everything that is not a result of / between structural zeros is calculated and updated.

This understanding is correct :).

Is there a good reason why we are allocating and filling the whole result with NaN instead of only the parts corresponding to structural zeros (in this case structural zeros in b)? I understand the reasoning behind assuming that most elements in sparse arrays are zeros, for efficiency reasons, but then I don't think that a dense b should be converted to be sparse and run through this method.

The tl;dr is that (1) these methods were primarily designed for combinations involving sparse/structured vectors/matrices and scalars, (2) the mechanisms allowing combinations including dense vectors/matrices were something of an afterthought, and (3) consequently these methods are suboptimal for combinations including dense vectors/matrices. For such combinations, substantial room for improvement exists :).

Sure, what I meant is that @benchmark estimates approximately a factor 4 slowdown and a reduction of memory usage for scale1! compared to scale2!, but @time is able to measure the actual slowdown of a factor 10+ and the increase in memory allocation of a factor 200+.

Benchmarking is tricky :). Note that (1) as used above, @time also captures compilation time and memory consumption; (2) as martinholters mentioned, benchmarking at global scope is problematic; (3) the operations (and so timings) above are suspect due to the input-output aliasing; and (4) on the first call of scale?!(A, b) after the A = copy(A0) calls above, A gets reallocated to accommodate worst-case storage needs, but such reallocation does not necessarily occur on subsequent calls (A having previously be expanded), so @time captures that reallocation cost whereas @benchmark does not necessarily.

Best!

mfalt commented 7 years ago

For such combinations, substantial room for improvement exists :).

I see, hopefully I will be able to find some time to contribute more to julia in the close future, I have several issues related to sparse matrices and factorization I will try to take a stab at when i find the time, and your explanations are very useful for me to understand the language better.

Benchmarking is tricky :)

Yes, I actually took several steps to create more reliable results, which I didn't show, I really think that the actual performance is closer to the one reported by @time (although I might be wrong)

(1) as used above, @time also captures compilation time and memory consumption;

I did run @time a couple of times, to avoid this and posted printed the result after a few runs

(2) as martinholters mentioned, benchmarking at global scope is problematic; (3) the operations (and so timings) above are suspect due to the input-output aliasing;

Could you elaborate? I did put the main code in functions, I thought this would remove most of the problems

(4) on the first call

See (1)

Note also that for some reason @benchmark ran 4 times more test on sclale2!, I'm guessing this is closely related to the time extra time it takes? Anyway, I think we should focus should be on how we can make the code better, and not on how the benchmark was preformed :)

Sacha0 commented 7 years ago

hopefully I will be able to find some time to contribute more to julia in the close future, I have several issues related to sparse matrices and factorization I will try to take a stab at when i find the time

That would be great! :)

Could you elaborate? I did put the main code in functions, I thought this would remove most of the problems

Ref. the relevant section of the BenchmarkTools.jl manual.

Best!

jebej commented 7 years ago

Just got hit by the literal vs. variable discrepancy. This issue, along with JuliaLang/julia#21693 makes me seem that for now dotops should really just throw an error instead of silently giving the wrong result...

julia> X = sprand(10,0.1)
10-element SparseVector{Float64,Int64} with 2 stored entries:
  [4 ]  =  0.492868
  [8 ]  =  0.0962979

julia> a=0
0

julia> X./0
10-element SparseVector{Float64,Int64} with 10 stored entries:
  [1 ]  =  NaN
  [2 ]  =  NaN
  [3 ]  =  NaN
  [4 ]  =  Inf
  [5 ]  =  NaN
  [6 ]  =  NaN
  [7 ]  =  NaN
  [8 ]  =  Inf
  [9 ]  =  NaN
  [10]  =  NaN

julia> X./a
10-element SparseVector{Float64,Int64} with 2 stored entries:
  [4 ]  =  Inf
  [8 ]  =  Inf
jebej commented 7 years ago

Argh, actually this is an other issue, since the generic sparse broadcast was not done for SparseVector, and the definition is still only applies the operation to nonzeros.

Sacha0 commented 7 years ago

Argh, actually this is an other issue, since the generic sparse broadcast was not done for SparseVector, and the definition is still only applies the operation to nonzeros.

Generic sparse broadcast handles SparseVectors. Rather, the X ./ a case in your example hits the dot-operator deprecation here, which in turn calls broadcast(/, X, A). broadcast(/, X, a) hits this set of broadcast specializations instead of generic sparse broadcast, yielding the observed behavior.

That set of methods looks like it should be removed in the interest of correctness. I will eke out time for a pull request removing those methods this weekend if possible.

Thanks for catching those stragglers @jebej! :)

jebej commented 7 years ago

Hm how do you call the generic sparse broadcast with sparse vectors?

edit: I can manually call broadcast_c to bypass the specializations.

Sacha0 commented 7 years ago

Hm how do you call the generic sparse broadcast with sparse vectors?

broadcast over combinations of sparse/structured vectors/matrices, scalars, and Vectors/Matrixs should hit generic sparse broadcast (where no specializations exist for the particular combination of arguments). Please see NEWS.md for more.

I had figured that there was no broadcast for SparseVector because the methods here act only on the nonzeros. broadcast itself is hijacked to call those nonzero-only methods.

As above (https://github.com/JuliaLang/julia/issues/21515#issuecomment-313869996), those specializations are incorrect and should be removed. Best!