Large (20x) runtime difference between value and Jacobian with simple sparsity pattern

moyner commented 3 years ago

As mentioned in issue #136, I am observing a discrepancy in performance between the AD and non-AD runtime that is larger than expected. Here is a MWE with a simple sparsity pattern that isolates the behavior and performs some sanity checks:

using ForwardDiff
using SparsityDetection, SparseArrays
using SparseDiffTools
using BenchmarkTools
using Test

n = 1000*1000
# n = 1000
L = [i for i in 1:n]
R = [i % n + 1 for i in 1:n]

function g(out, in)
    n = length(in)
    for i in eachindex(in)
        l = i
        r = i % n + 1
        out[i] = in[l] - in[r]
    end
end
inp = rand(n)
out = similar(inp)
# Time case without any AD
println("Calling value version:")
@btime g(out, inp)

pattern = jacobian_sparsity(g, out, inp)
jacA = Float64.(sparse(pattern)) # Output from non-cached version
jacB = Float64.(sparse(pattern)) # Output from JacCache'd version

colors = matrix_colors(jacA)
# Verify that it works without cache
println("Calling sparse AD without cache")
@btime forwarddiff_color_jacobian!(jacA, g, inp, colorvec = colors,
                                           sparsity = pattern)
# Set up cache
cache = ForwardColorJacCache(g, inp, colorvec = colors, 
                                     sparsity = pattern);
# Call with cache
println("Calling sparse AD with cache")
@btime forwarddiff_color_jacobian!(jacB, g, inp, cache)
# Check that they have same pattern & values
@test nnz(jacA-jacB) == 0
# Basic sanity check
@test maximum(colors) == 2

The function is a cyclical version of diff and the coloring is just 1,2 alternating with the Jacobian as a general sparse matrix. Filling the Jacobian should ideally be somewhere around 2-4 x the value function call from two AD passes since each Dual contains two floats if we disregard cache effects and the cost of dealing with the sparse data structure. Output for n = 2, 1000 and 1,000,000:

n = 1000000

Calling value version:
  25.803 ns (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  8.733 μs (83 allocations: 4.38 KiB)
Calling sparse AD with cache
  93.172 ns (0 allocations: 0 bytes)

n = 1000

Calling value version:
  1.550 μs (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  213.300 μs (2573 allocations: 223.75 KiB)
Calling sparse AD with cache
  26.000 μs (0 allocations: 0 bytes)

n = 1000000

Calling value version:
  1.579 ms (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  283.338 ms (2999579 allocations: 218.15 MiB)
Calling sparse AD with cache
  35.923 ms (0 allocations: 0 bytes)

We see that the cache seems to work and gives a 8x or so improvement in runtime, and eliminates the allocations while yielding the same output, which is good! The ratio between the value and the AD versions is 3.72 for n = 2, 17.3 for n = 1000 and 23.3 for n = 1e6 so there may be something that is not linear here with respect to the input size... Or I have a bug somewhere in my setup. My versioninfo():

Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, icelake-client)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Relevant packages:

  "ForwardDiff"       => v"0.10.17"
  "BenchmarkTools"    => v"0.7.0"
  "SparsityDetection" => v"0.3.4"
  "SparseDiffTools" => v"1.13.0"

ChrisRackauckas commented 3 years ago

Capture3

All of the time is spent in the decompression.

https://github.com/JuliaDiff/FiniteDiff.jl/blob/v2.8.0/src/iteration_utils.jl#L9-L17

This is the step of going through the result and placing the value into the column with the non-zero in the row. Sparse matrix indexing is slow, and so resolving this is slow. But maybe someone has a better idea for how to iterate through this? @YingboMa @chriselrod

sjdaines commented 3 years ago

Partial fix in PR https://github.com/JuliaDiff/SparseDiffTools.jl/pull/146

This adds a fast path that avoids slow sparse indexing if J and sparsity are both SparseMatrixCSC with the same sparsity pattern.

I now get for the ratio between the value and the AD versions: 3.7 for n = 2, 4.8 for n = 1000 and 7.8 for n = 1e6 (note these numbers do vary between runs presumably due to differences in random matrices)

n = 2

Calling value version:
  18.500 ns (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  11.600 μs (80 allocations: 4.20 KiB)
Calling sparse AD with cache
  68.135 ns (0 allocations: 0 bytes)

n = 1000

Calling value version:
  2.522 μs (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  231.202 μs (2570 allocations: 223.58 KiB)
Calling sparse AD with cache
  12.100 μs (0 allocations: 0 bytes)

n = 1000000

Calling value version:
  2.736 ms (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
  285.880 ms (2999576 allocations: 218.15 MiB)
Calling sparse AD with cache
  21.310 ms (0 allocations: 0 bytes)

moyner commented 2 years ago

Fixed by PR #146. Thanks!

JuliaDiff / SparseDiffTools.jl

Large (20x) runtime difference between value and Jacobian with simple sparsity pattern #138