JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.43k stars 5.45k forks source link

sparse multiplication is slow #942

Closed ViralBShah closed 11 years ago

ViralBShah commented 12 years ago

Julia:

julia> s = sparse(ones(1000,1000));
julia> @time s*s
elapsed time: 6.074394941329956 seconds

Matlab:

>> s = sparse(ones(1000,1000));
>> tic; s*s; toc;
Elapsed time is 2.198466 seconds.
doobwa commented 12 years ago

I am having trouble replicating this.

julia> s=sparse(ones(1000,1000))
1000x1000 sparse matrix with 1000000 nonzeros:
[1   ,    1]  =  1.0
[2   ,    1]  =  1.0
[3   ,    1]  =  1.0
[4   ,    1]  =  1.0
[... I omitted the rest of the output... ]
julia> s*s;
no method *(SparseMatrixCSC{Float64,Int64},SparseMatrixCSC{Float64,Int64})
 in method_missing at base.jl:60

I have run make with the latest commit. Any ideas what I might be doing wrong?

ViralBShah commented 12 years ago

You need to do load("linalg_sparse.jl") as well.

-viral

On 24-Jun-2012, at 5:03 PM, Chris DuBois wrote:

I am having trouble replicating this.

s=sparse(ones(1000,1000)) 1000x1000 sparse matrix with 1000000 nonzeros: [1 , 1] = 1.0 [2 , 1] = 1.0 [3 , 1] = 1.0 [4 , 1] = 1.0 julia> ss; no method (SparseMatrixCSC{Float64,Int64},SparseMatrixCSC{Float64,Int64}) in method_missing at base.jl:60

I have run make with the latest commit. Any ideas what I might be doing wrong?


Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/942#issuecomment-6532183

ViralBShah commented 12 years ago

SparseAccumulator overhead is most likely quite high. The code should be reimplemented along the lines of this CXSparse implementation:

http://www.cise.ufl.edu/research/sparse/CXSparse/CXSparse/Source/cs_multiply.c

doobwa commented 12 years ago

I am no expert in this area, but I posted julia code for the naive matrix multiplication mentioned in Gustavson 1978. Maybe it could serve as a useful starting point for someone more versed in making things in Julia fast. As it stands, the method is a bit slower than julia's current implementation; see the example at the bottom of the gist.

The code you mention seems to be LGPL'd, so any derivative work would also have to be LGPL'd, correct? This source code is also printed in his 2006 book, though. The approach appears very similar to the Gustavson method, though I didn't check this thoroughly.

ViralBShah commented 12 years ago

This is essentially what CSparse also does. I will try it out.

ViralBShah commented 12 years ago

On lines 32 and 34, you are recomputing out VA[jp] every time. The compiler is not yet smart enough to pull it out of the loop. If you do so, you get the same speed as the existing julia code. I would have expected this to be faster, since it is doing lesser work, and is simpler code - but there seems to be little difference.

ViralBShah commented 12 years ago

@doobwa Could you license this under the MIT license? With a bit of cleanup, this can replace the existing code, even though the speed is the same.

ViralBShah commented 12 years ago

@JeffBezanson I believe this is the same algorithm that matlab implements. We are off by a factor of 2 compared to a C implementation. Any ideas what we can do to match performance here, or should we call a library function from CXSparse to get good performance?

function matmul{Tv,Ti}(A::SparseMatrixCSC{Tv,Ti}, B::SparseMatrixCSC{Tv,Ti})
  mA, nA = size(A)
  mB, nB = size(B)
  if nA != mB; error("mismatched dimensions"); end
  colptrA = A.colptr; rowvalA = A.rowval; nzvalA = A.nzval
  colptrB = B.colptr; rowvalB = B.rowval; nzvalB = B.nzval
  nnzC = length(nzvalA) + length(nzvalB)
  colptrC = Array(Ti, nB+1)
  rowvalC = Array(Ti, nnzC)
  nzvalC = Array(Tv, nnzC)
  ip = 1
  xb = zeros(Ti, mA)
  x  = zeros(Tv, mA)
  for i in 1:nB
      colptrC[i] = ip
      for jp in colptrB[i]:(colptrB[i+1] - 1)
          j = rowvalB[jp]
          nzB = nzvalB[jp]
          for kp in colptrA[j]:(colptrA[j+1] - 1)
              k = rowvalA[kp]
              nzA = nzvalA[kp]
              if xb[k] != i
                  rowvalC[ip] = k
                  ip += 1
                  xb[k] = i
                  x[k] = nzA * nzB
              else
                  x[k] += nzA * nzB
              end
          end
      end
      for vp in colptrC[i]:(ip - 1)
          nzvalC[vp] = x[rowvalC[vp]]
      end
  end
  colptrC[nB+1] = ip

  return SparseMatrixCSC(mA,nB,colptrC,rowvalC,nzvalC)
end
ViralBShah commented 12 years ago

Also, I can get a 30% speedup here if I enable all the optimization passes in codegen.cpp. It helps the code above, but not the matrix multiply that is currently implemented.

ViralBShah commented 12 years ago

@doobwa Your implementation reads like it is for row-wise storage. I would have expected the outer-loop over columns of B, and inner loops going over columns of A. I have corrected that in my implementation above. Performance is the same for the testcases that are presented in this issue, since access patterns and work does not change.

doobwa commented 12 years ago

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.

ViralBShah commented 12 years ago

Yes, please do issue a pull request.

-viral

On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.


Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/942#issuecomment-6579482

ViralBShah commented 12 years ago

That's right, the changes only give marginal speedups. If you comment out all the other passes in codegen.cpp, you get more aggressive optimization and some noticeable speedups. We probably need more compiler work for this code to perform better - but have to wait for @JeffBezanson to chime in.

-viral

On 26-Jun-2012, at 9:20 PM, Chris DuBois wrote:

@ViralBShah I have updated the gist with your suggested changes and the MIT license. My comments in the original version were not correct (when I said row I meant column!). The order of those loops doesn't really matter, so I have kept them as they are in the original pseudocode.

I also added a small bug fix regarding the rowval and nzval values returned (which incidentally also occurs in the current sparse matrix multiplication as well -- I can submit a pull request if you are interested).

With your changes I am getting some marginal speedups over the current implementation.


Reply to this email directly or view it on GitHub: https://github.com/JuliaLang/julia/issues/942#issuecomment-6579482

pao commented 12 years ago

Stupid lack of undo and tiny phone buttons. Sorry @doobwa, didn't mean to kill that, was just going to add a #977 so Github would cross-reference.

doobwa commented 12 years ago

In some cases (that are somewhat reasonable for my particular application) I am getting sizeable speedups compared to the old version, which is encouraging, though I don't have a C or Matlab implementation to compare against.

N = 100000
n = 10000  # number of nonzero  elements
ix = int(floor(N*N * rand(n)) + 1)  # pick random elements throughout entire matrix
I = [mod(i,N) for i = ix]  # row
J = [int(i/N) for i = ix]  # col
V = int(ones(length(I)))  # all 1's

s = sparse(int32(vec(I)),int32(vec(J)),vec(V),N,N)

@time a=s*s;
@time b=matmul(s,s);

julia> @time a=s*s;
elapsed time: 0.03193998336791992 seconds

julia> @time b=matmul(s,s);
elapsed time: 0.0016660690307617188 seconds```
ViralBShah commented 12 years ago

@doobwa Did you try non-square matrices?

julia> matmul(sparse(rand(100,10)),sparse(rand(10,100)))
in matmul: arrayref: index out of range
 in matmul at matmul.jl:29
ViralBShah commented 12 years ago

We need the ability to grow the allocated region.

ViralBShah commented 12 years ago

I added the ability to grow, and merged your patches for del manually. Even after so, I was getting errors with the code in gist. With my version of switched loops, it works fine. I am checking my version of the column-wise iteration master for now. Can you check it out?

ViralBShah commented 12 years ago

The runtime is now down to 4.9 seconds from 6 seconds earlier. Disabling the emit_bounds_check and enabling all LLVM passes gets it further down to 3.9 seconds. Adding FPM->add(createBBVectorizePass()) at codegen.cc:2144 takes the runtime further down to 3.6 seconds.

However, it is still about two times slower than the C implementation that is in Matlab.

ViralBShah commented 11 years ago

This is now taking 6.3 seconds again. Seems like we have regressed a bit on performance.

johnmyleswhite commented 11 years ago

We really need to build up a large benchmark suite and test performance with each new build on all of the tests.

ViralBShah commented 11 years ago

That is exactly what I have been thinking. We need a new issue for continuous performance benchmarking. As has been discussed elsewhere, we actually do have some machines at MIT that can do this reliably. Even if not on every build, with a cron job, it should be easy to do this every day.

ViralBShah commented 11 years ago

Looking through the code here, I realized that the work has increased here. There is a double transpose at the end now to ensure that all row indices are in sorted order, and we are using 64-bit indices by default for sparse matrices on x86_64 now.

JeffBezanson commented 11 years ago

@ViralBShah see if that helps

ViralBShah commented 11 years ago

Now again at 3.3 seconds. Still slower than Matlab, but I am hoping that @inbounds may help close the gap some more.

ViralBShah commented 11 years ago

The inbounds macro takes this to 2.7 seconds, and I have some potential code improvements in mind to try out that may completely close this gap.

lindahua commented 11 years ago

When the issue of pointer being reloaded at each iteration is addressed (as discussed in #3268), the performance of this will be further improved.

ViralBShah commented 11 years ago

@JeffBezanson This has slowed down again and is taking almost 4.0 seconds now, over the last couple of weeks.

JeffBezanson commented 11 years ago

I also observed something strange: using Ct.' was significantly faster for me than calling transpose!. I don't see how that could be, but I don't fully understand the code.

timholy commented 11 years ago

For a minute I thought this was my recent transpose!, so I investigated. In the course of my investigations, I noticed something really strange: the floating-point operations in (*) are vastly slower than the integer operations.

First, I changed the function to see the types, to split components up onto separate lines, and to make sure there weren't any order effects (i.e., profiler oddities, which I'm always paranoid about but don't seem to materialize):

function (*){Tv,Ti}(A::SparseMatrixCSC{Tv,Ti}, B::SparseMatrixCSC{Tv,Ti})
    mA, nA = size(A)
    mB, nB = size(B)
    if nA != mB; error("mismatched dimensions"); end

    colptrA = A.colptr; rowvalA = A.rowval; nzvalA = A.nzval
    @show typeof(rowvalA)
    @show size(rowvalA)
    @show typeof(nzvalA)
    @show size(nzvalA)
    colptrB = B.colptr; rowvalB = B.rowval; nzvalB = B.nzval
    # TODO: Need better estimation of result space
    nnzC = min(mA*nB, length(nzvalA) + length(nzvalB))
    colptrC = Array(Ti, nB+1)
    rowvalC = Array(Ti, nnzC)
    nzvalC = Array(Tv, nnzC)

    @inbounds begin
        ip = 1
        xb = zeros(Ti, mA)
        x  = zeros(Tv, mA)
        for i in 1:nB
            if ip + mA - 1 > nnzC
                resize!(rowvalC, nnzC + max(nnzC,mA))
                resize!(nzvalC, nnzC + max(nnzC,mA))
                nnzC = length(nzvalC)
            end
            colptrC[i] = ip
            for jp in colptrB[i]:(colptrB[i+1] - 1)
                j = rowvalB[jp]                                     # line 94
                nzB = nzvalB[jp]                                 # line 95. Vastly slower than 94
                for kp in colptrA[j]:(colptrA[j+1] - 1)
                    tmp = nzvalA[kp]                            # line 97. Vastly slower than 100
                    nzC = tmp * nzB                             # line 98
#                     nzC = nzvalA[kp] * nzB
                    k = rowvalA[kp]                               # line 100
                    if xb[k] != i
                        rowvalC[ip] = k
                        ip += 1
                        xb[k] = i
                        x[k] = nzC                                   # line 105
                    else
                        x[k] += nzC                                 # line 107
                    end
                end
            end
            for vp in colptrC[i]:(ip - 1)
                nzvalC[vp] = x[rowvalC[vp]]
            end
        end
        colptrC[nB+1] = ip
    end

    splice!(rowvalC, colptrC[end]:length(rowvalC))
    splice!(nzvalC, colptrC[end]:length(nzvalC))

    # The Gustavson algorithm does not guarantee the product to have sorted row indices.
    return ((SparseMatrixCSC(mA, nB, colptrC, rowvalC, nzvalC).').')
end

Now:

julia> @sprofile s*s;
typeof(rowvalA) => Array{Int64,1}
size(rowvalA) => (1000000,)
typeof(nzvalA) => Array{Float64,1}
size(nzvalA) => (1000000,)

julia> sprofile_tree(true)
6479 julia; ???; line: 0
 6479 /lib/x86_64-linux-gnu/libc.so.6; __libc_start_main; line: 237
  6479 julia; ???; line: 0
   6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; julia_trampoline; line: 131
    6479 julia; ???; line: 0
     6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
      6479 ???; ???; line: 0
       6479 client.jl; _start; line: 369
        6479 client.jl; run_repl; line: 166
         6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
          6479 ???; ???; line: 0
           6479 client.jl; eval_user_input; line: 91
            6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
             6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
              6479 /home/tim/.julia/Profile/src/sprofile.jl; anonymous; line: 309
               6479 /home/tim/src/julia/usr/bin/../lib/libjulia-release.so; ???; line: 0
                1    linalg/sparse.jl; *; line: 93
                6    linalg/sparse.jl; *; line: 94
                341  linalg/sparse.jl; *; line: 95
                8    linalg/sparse.jl; *; line: 96
                1107 linalg/sparse.jl; *; line: 97
                1711 linalg/sparse.jl; *; line: 98
                124  linalg/sparse.jl; *; line: 100
                516  linalg/sparse.jl; *; line: 101
                3    linalg/sparse.jl; *; line: 102
                339  linalg/sparse.jl; *; line: 105
                2244 linalg/sparse.jl; *; line: 107
                2    linalg/sparse.jl; *; line: 111
                14   linalg/sparse.jl; *; line: 112
                63   linalg/sparse.jl; *; line: 122
                 10 sparse/csparse.jl; _rowcounts; line: 126
                 1  sparse/csparse.jl; transpose!; line: 145
                 2  sparse/csparse.jl; transpose!; line: 147
                 18 sparse/csparse.jl; transpose!; line: 148
                 32 sparse/csparse.jl; transpose!; line: 149
ViralBShah commented 11 years ago

The transpose! expects some work to have been done in setting up colptr, which can be expensive and without which the result is likely to be incorrect. Not documented and perhaps should have a different name.

ViralBShah commented 11 years ago

Jeff - the slowdown existed prior to my commits today. My commits improve performance but those are basically memory related and unrelated to the actual multiplication.

ViralBShah commented 11 years ago

After fbf3ed8c586f0b116c31d35d990e73ff249b84c1 the timing is now at 2.95 seconds.

StefanKarpinski commented 11 years ago

Some distillation of this code should go into our performance suite.

ViralBShah commented 11 years ago

sparse matmul is already in the perf suite. Codespeed should help identify these regressions. I just happened to stumble across this as I was updating it.