Closed moyner closed 2 years ago
All of the time is spent in the decompression.
https://github.com/JuliaDiff/FiniteDiff.jl/blob/v2.8.0/src/iteration_utils.jl#L9-L17
This is the step of going through the result and placing the value into the column with the non-zero in the row. Sparse matrix indexing is slow, and so resolving this is slow. But maybe someone has a better idea for how to iterate through this? @YingboMa @chriselrod
Partial fix in PR https://github.com/JuliaDiff/SparseDiffTools.jl/pull/146
This adds a fast path that avoids slow sparse indexing if J and sparsity are both SparseMatrixCSC with the same sparsity pattern.
I now get for the ratio between the value and the AD versions: 3.7 for n = 2, 4.8 for n = 1000 and 7.8 for n = 1e6 (note these numbers do vary between runs presumably due to differences in random matrices)
n = 2
Calling value version:
18.500 ns (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
11.600 μs (80 allocations: 4.20 KiB)
Calling sparse AD with cache
68.135 ns (0 allocations: 0 bytes)
n = 1000
Calling value version:
2.522 μs (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
231.202 μs (2570 allocations: 223.58 KiB)
Calling sparse AD with cache
12.100 μs (0 allocations: 0 bytes)
n = 1000000
Calling value version:
2.736 ms (0 allocations: 0 bytes)
Explored path: SparsityDetection.Path(Bool[], 1)
Calling sparse AD without cache
285.880 ms (2999576 allocations: 218.15 MiB)
Calling sparse AD with cache
21.310 ms (0 allocations: 0 bytes)
Fixed by PR #146. Thanks!
As mentioned in issue #136, I am observing a discrepancy in performance between the AD and non-AD runtime that is larger than expected. Here is a MWE with a simple sparsity pattern that isolates the behavior and performs some sanity checks:
The function is a cyclical version of
diff
and the coloring is just 1,2 alternating with the Jacobian as a general sparse matrix. Filling the Jacobian should ideally be somewhere around 2-4 x the value function call from two AD passes since each Dual contains two floats if we disregard cache effects and the cost of dealing with the sparse data structure. Output for n = 2, 1000 and 1,000,000:We see that the cache seems to work and gives a 8x or so improvement in runtime, and eliminates the allocations while yielding the same output, which is good! The ratio between the value and the AD versions is 3.72 for n = 2, 17.3 for n = 1000 and 23.3 for n = 1e6 so there may be something that is not linear here with respect to the input size... Or I have a bug somewhere in my setup. My
versioninfo()
:Relevant packages: