Closed ViralBShah closed 11 years ago
I am having trouble replicating this.
julia> s=sparse(ones(1000,1000))
1000x1000 sparse matrix with 1000000 nonzeros:
[1 , 1] = 1.0
[2 , 1] = 1.0
[3 , 1] = 1.0
[4 , 1] = 1.0
[... I omitted the rest of the output... ]
julia> s*s;
no method *(SparseMatrixCSC{Float64,Int64},SparseMatrixCSC{Float64,Int64})
in method_missing at base.jl:60
I have run make with the latest commit. Any ideas what I might be doing wrong?
You need to do load("linalg_sparse.jl")
as well.
-viral
On 24-Jun-2012, at 5:03 PM, Chris DuBois wrote:
I am having trouble replicating this.
s=sparse(ones(1000,1000)) 1000x1000 sparse matrix with 1000000 nonzeros: [1 , 1] = 1.0 [2 , 1] = 1.0 [3 , 1] = 1.0 [4 , 1] = 1.0 julia> ss; no method (SparseMatrixCSC{Float64,Int64},SparseMatrixCSC{Float64,Int64}) in method_missing at base.jl:60
I have run make with the latest commit. Any ideas what I might be doing wrong?
Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/942#issuecomment-6532183
SparseAccumulator overhead is most likely quite high. The code should be reimplemented along the lines of this CXSparse implementation:
http://www.cise.ufl.edu/research/sparse/CXSparse/CXSparse/Source/cs_multiply.c
I am no expert in this area, but I posted julia code for the naive matrix multiplication mentioned in Gustavson 1978. Maybe it could serve as a useful starting point for someone more versed in making things in Julia fast. As it stands, the method is a bit slower than julia's current implementation; see the example at the bottom of the gist.
The code you mention seems to be LGPL'd, so any derivative work would also have to be LGPL'd, correct? This source code is also printed in his 2006 book, though. The approach appears very similar to the Gustavson method, though I didn't check this thoroughly.
This is essentially what CSparse also does. I will try it out.
On lines 32 and 34, you are recomputing out VA[jp] every time. The compiler is not yet smart enough to pull it out of the loop. If you do so, you get the same speed as the existing julia code. I would have expected this to be faster, since it is doing lesser work, and is simpler code - but there seems to be little difference.
@doobwa Could you license this under the MIT license? With a bit of cleanup, this can replace the existing code, even though the speed is the same.
@JeffBezanson I believe this is the same algorithm that matlab implements. We are off by a factor of 2 compared to a C implementation. Any ideas what we can do to match performance here, or should we call a library function from CXSparse to get good performance?
function matmul{Tv,Ti}(A::SparseMatrixCSC{Tv,Ti}, B::SparseMatrixCSC{Tv,Ti})
mA, nA = size(A)
mB, nB = size(B)
if nA != mB; error("mismatched dimensions"); end
colptrA = A.colptr; rowvalA = A.rowval; nzvalA = A.nzval
colptrB = B.colptr; rowvalB = B.rowval; nzvalB = B.nzval
nnzC = length(nzvalA) + length(nzvalB)
colptrC = Array(Ti, nB+1)
rowvalC = Array(Ti, nnzC)
nzvalC = Array(Tv, nnzC)
ip = 1
xb = zeros(Ti, mA)
x = zeros(Tv, mA)
for i in 1:nB
colptrC[i] = ip
for jp in colptrB[i]:(colptrB[i+1] - 1)
j = rowvalB[jp]
nzB = nzvalB[jp]
for kp in colptrA[j]:(colptrA[j+1] - 1)
k = rowvalA[kp]
nzA = nzvalA[kp]
if xb[k] != i
rowvalC[ip] = k
ip += 1
xb[k] = i
x[k] = nzA * nzB
else
x[k] += nzA * nzB
end
end
end
for vp in colptrC[i]:(ip - 1)
nzvalC[vp] = x[rowvalC[vp]]
end
end
colptrC[nB+1] = ip
return SparseMatrixCSC(mA,nB,colptrC,rowvalC,nzvalC)
end
Also, I can get a 30% speedup here if I enable all the optimization passes in codegen.cpp. It helps the code above, but not the matrix multiply that is currently implemented.
@doobwa Your implementation reads like it is for row-wise storage. I would have expected the outer-loop over columns of B, and inner loops going over columns of A. I have corrected that in my implementation above. Performance is the same for the testcases that are presented in this issue, since access patterns and work does not change.
@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.
I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).
With your changes I am getting some marginal speedups over the current implementation.
Yes, please do issue a pull request.
-viral
On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:
@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.
I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).
With your changes I am getting some marginal speedups over the current implementation.
Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/942#issuecomment-6579482
That's right, the changes only give marginal speedups. If you comment out all the other passes in codegen.cpp, you get more aggressive optimization and some noticeable speedups. We probably need more compiler work for this code to perform better - but have to wait for @JeffBezanson to chime in.
-viral
On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:
@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.
I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).
With your changes I am getting some marginal speedups over the current implementation.
Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/942#issuecomment-6579482
Stupid lack of undo and tiny phone buttons. Sorry @doobwa, didn't mean to kill that, was just going to add a #977 so Github would cross-reference.
In some cases (that are somewhat reasonable for my particular application) I am getting sizeable speedups compared to the old version, which is encouraging, though I don't have a C or Matlab implementation to compare against.
N = 100000
n = 10000 # number of nonzero elements
ix = int(floor(N*N * rand(n)) + 1) # pick random elements throughout entire matrix
I = [mod(i,N) for i = ix] # row
J = [int(i/N) for i = ix] # col
V = int(ones(length(I))) # all 1's
s = sparse(int32(vec(I)),int32(vec(J)),vec(V),N,N)
@time a=s*s;
@time b=matmul(s,s);
julia> @time a=s*s;
elapsed time: 0.03193998336791992 seconds
julia> @time b=matmul(s,s);
elapsed time: 0.0016660690307617188 seconds```
@doobwa Did you try non-square matrices?
julia> matmul(sparse(rand(100,10)),sparse(rand(10,100)))
in matmul: arrayref: index out of range
in matmul at matmul.jl:29
We need the ability to grow the allocated region.
I added the ability to grow, and merged your patches for del manually. Even after so, I was getting errors with the code in gist. With my version of switched loops, it works fine. I am checking my version of the column-wise iteration master for now. Can you check it out?
The runtime is now down to 4.9 seconds from 6 seconds earlier. Disabling the emit_bounds_check and enabling all LLVM passes gets it further down to 3.9 seconds. Adding FPM->add(createBBVectorizePass())
at codegen.cc:2144
takes the runtime further down to 3.6 seconds.
However, it is still about two times slower than the C implementation that is in Matlab.
This is now taking 6.3 seconds again. Seems like we have regressed a bit on performance.
We really need to build up a large benchmark suite and test performance with each new build on all of the tests.
That is exactly what I have been thinking. We need a new issue for continuous performance benchmarking. As has been discussed elsewhere, we actually do have some machines at MIT that can do this reliably. Even if not on every build, with a cron job, it should be easy to do this every day.
Looking through the code here, I realized that the work has increased here. There is a double transpose at the end now to ensure that all row indices are in sorted order, and we are using 64-bit indices by default for sparse matrices on x86_64 now.
@ViralBShah see if that helps
Now again at 3.3 seconds. Still slower than Matlab, but I am hoping that @inbounds
may help close the gap some more.
The inbounds macro takes this to 2.7 seconds, and I have some potential code improvements in mind to try out that may completely close this gap.
When the issue of pointer being reloaded at each iteration is addressed (as discussed in #3268), the performance of this will be further improved.
@JeffBezanson This has slowed down again and is taking almost 4.0 seconds now, over the last couple of weeks.
I also observed something strange: using Ct.'
was significantly faster for me than calling transpose!
. I don't see how that could be, but I don't fully understand the code.
For a minute I thought this was my recent transpose!
, so I investigated. In the course of my investigations, I noticed something really strange: the floating-point operations in (*)
are vastly slower than the integer operations.
First, I changed the function to see the types, to split components up onto separate lines, and to make sure there weren't any order effects (i.e., profiler oddities, which I'm always paranoid about but don't seem to materialize):
function (*){Tv,Ti}(A::SparseMatrixCSC{Tv,Ti}, B::SparseMatrixCSC{Tv,Ti})
mA, nA = size(A)
mB, nB = size(B)
if nA != mB; error("mismatched dimensions"); end
colptrA = A.colptr; rowvalA = A.rowval; nzvalA = A.nzval
@show typeof(rowvalA)
@show size(rowvalA)
@show typeof(nzvalA)
@show size(nzvalA)
colptrB = B.colptr; rowvalB = B.rowval; nzvalB = B.nzval
# TODO: Need better estimation of result space
nnzC = min(mA*nB, length(nzvalA) + length(nzvalB))
colptrC = Array(Ti, nB+1)
rowvalC = Array(Ti, nnzC)
nzvalC = Array(Tv, nnzC)
@inbounds begin
ip = 1
xb = zeros(Ti, mA)
x = zeros(Tv, mA)
for i in 1:nB
if ip + mA - 1 > nnzC
resize!(rowvalC, nnzC + max(nnzC,mA))
resize!(nzvalC, nnzC + max(nnzC,mA))
nnzC = length(nzvalC)
end
colptrC[i] = ip
for jp in colptrB[i]:(colptrB[i+1] - 1)
j = rowvalB[jp] # line 94
nzB = nzvalB[jp] # line 95. Vastly slower than 94
for kp in colptrA[j]:(colptrA[j+1] - 1)
tmp = nzvalA[kp] # line 97. Vastly slower than 100
nzC = tmp * nzB # line 98
# nzC = nzvalA[kp] * nzB
k = rowvalA[kp] # line 100
if xb[k] != i
rowvalC[ip] = k
ip += 1
xb[k] = i
x[k] = nzC # line 105
else
x[k] += nzC # line 107
end
end
end
for vp in colptrC[i]:(ip - 1)
nzvalC[vp] = x[rowvalC[vp]]
end
end
colptrC[nB+1] = ip
end
splice!(rowvalC, colptrC[end]:length(rowvalC))
splice!(nzvalC, colptrC[end]:length(nzvalC))
# The Gustavson algorithm does not guarantee the product to have sorted row indices.
return ((SparseMatrixCSC(mA, nB, colptrC, rowvalC, nzvalC).').')
end
Now:
julia> @sprofile s*s;
typeof(rowvalA) => Array{Int64,1}
size(rowvalA) => (1000000,)
typeof(nzvalA) => Array{Float64,1}
size(nzvalA) => (1000000,)
julia> sprofile_tree(true)
6479 julia; ???; line: 0
6479 /lib/x86_64-linux-gnu/libc.so.6; __libc_start_main; line: 237
6479 julia; ???; line: 0
6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; julia_trampoline; line: 131
6479 julia; ???; line: 0
6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
6479 ???; ???; line: 0
6479 client.jl; _start; line: 369
6479 client.jl; run_repl; line: 166
6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
6479 ???; ???; line: 0
6479 client.jl; eval_user_input; line: 91
6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
6479 /home/tim/.julia/Profile/src/sprofile.jl; anonymous; line: 309
6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
1 linalg/sparse.jl; *; line: 93
6 linalg/sparse.jl; *; line: 94
341 linalg/sparse.jl; *; line: 95
8 linalg/sparse.jl; *; line: 96
1107 linalg/sparse.jl; *; line: 97
1711 linalg/sparse.jl; *; line: 98
124 linalg/sparse.jl; *; line: 100
516 linalg/sparse.jl; *; line: 101
3 linalg/sparse.jl; *; line: 102
339 linalg/sparse.jl; *; line: 105
2244 linalg/sparse.jl; *; line: 107
2 linalg/sparse.jl; *; line: 111
14 linalg/sparse.jl; *; line: 112
63 linalg/sparse.jl; *; line: 122
10 sparse/csparse.jl; _rowcounts; line: 126
1 sparse/csparse.jl; transpose!; line: 145
2 sparse/csparse.jl; transpose!; line: 147
18 sparse/csparse.jl; transpose!; line: 148
32 sparse/csparse.jl; transpose!; line: 149
The transpose! expects some work to have been done in setting up colptr, which can be expensive and without which the result is likely to be incorrect. Not documented and perhaps should have a different name.
Jeff - the slowdown existed prior to my commits today. My commits improve performance but those are basically memory related and unrelated to the actual multiplication.
After fbf3ed8c586f0b116c31d35d990e73ff249b84c1 the timing is now at 2.95 seconds.
Some distillation of this code should go into our performance suite.
sparse matmul is already in the perf suite. Codespeed should help identify these regressions. I just happened to stumble across this as I was updating it.
Julia:
Matlab: