requesting new method for * for: (real array)*(complex array)

AshtonSBradley commented 7 years ago

In MATLAB, if I do

Nrows=1024;
Ncols=1024;
A=rand(Nrows,Ncols)+i*rand(Nrows,Ncols);
B=rand(Nrows,Ncols);

tic;for j=1:50
B*A;
end
toc/50

the timing is 114ms

If I compare this to Julia:

Nrows=1024;Ncols=1024
A=complex(rand(Nrows,Ncols),rand(Nrows,Ncols))::Array{Complex{Float64},2}
B=complex(rand(Nrows,Ncols))::Array{Complex{Float64},2}

B*A #warm up
@benchmark B*A
mean time:        186.869 ms (0.00% GC)

However, if I break down the product into real and imaginary parts:

Ar=real(A)
Ai=imag(A)
Br=real(B)
Bi=imag(B)

and do

complex(Br*Ar,Br*Ai) #warm up
@benchmark complex(Br*Ar,Br*Ai)
  mean time:        138.624 ms (4.31% GC)

the results is closer to MATLAB. If I instead declare B::Array{Float64,2}, the mean time extends to 1.6s, presumably due to type conversion.

If I understand this correctly, it seems that the extra multiply-by-zero operations are still being evaluated by Julia in B*A when B is real.

Is there a better way to declare that B is real, for which a method will give performance gains?

Is it possible to add another method to * for the situation

(B::Array{Float64,2})*(A::Array{Complex{Float64},2})

that will get the full performance? Presumably it would then be faster than MATLAB...

Thanks in advance for any comments.

yuyichao commented 7 years ago

None of the two calculates should be doing extra conversion or multiplying by zero. The difference is whether BLAS is used.

andreasnoack commented 7 years ago

We might want to change the default behavior to convert the arguments to the same type. That might also save us some compile time (and therefore test time) because _generic_matmatmul! is quite slow to compile.

AshtonSBradley commented 7 years ago

not sure I understand the solution here. Do you mean BLAS vs LAPACK ? how do I force one over the other?

andyferris commented 7 years ago

We have genericJulia code for matrix multiplication, not just BLAS. (LAPACK leverages BLAS to do more complex things like inversion and de compositions).

BLAS supports only a small set of types - Float32/64 and Complex thereof. I'm not sure if there is a BLAS function for real*complex, but in your example I understand that the Julia code is used.

Nonetheless, using the striding mechanism we might be able to use BLAS calls with no extra allocations, I hope. Seems worthwhile and might be what MATLAB does.

andyferris commented 7 years ago

@andreasnoack avoiding extra allocations can be important for large matrices, so temporary conversions might not be optimal.

timholy commented 7 years ago

I think Matlab stores arrays of complex numbers as a struct-of-arrays (i.e., two full-size real-valued arrays), whereas julia (and C) use an array-of-structs (a single array of complex-valued objects). So in Matlab this particular optimization is trivial. (Both strategies have their advantages.)

From a performance perspective, the overhead of converting the type would be trivial for anything except the smallest matrices (O(N^2) for the conversion vs O(N^3) to do the multiplication). If we're worried about memory consumption, we could use a hybrid approach and do the conversion on tiles.

Of course, best would be to get BLAS-level performance from Julia code and skip calling BLAS altogether. For some algorithms we're there, but not for matmul.

andreasnoack commented 7 years ago

Nonetheless, using the striding mechanism we might be able to use BLAS calls with no extra allocations, I hope. Seems worthwhile and might be what MATLAB does.

I don't think this is the case because BLAS-3 requires the first stride to be contiguous and splitting the computation into a sequence of BLAS-2 (which allow non-unit stride) calls will most likely kill performance.

I don't think XBLAS implementations are well maintained so copying to a complex matrix might be our best option here.

AshtonSBradley commented 7 years ago

Ok, interesting, it will be interesting to see how this develops. I don't know what was happening with Matlab when I ran the above tests, but revisiting it carefully, it only takes 55ms for the Matlab code posted above, so julia seems to be slower than Matlab by a factor >3, for B*A with B real and A complex in the sense declared above.

Breaking up the calculation into real products and casting the result as complex brings the factor closer to 2.

StefanKarpinski commented 7 years ago

This is certainly a common/important enough case that we should have an optimized method for it that does whatever we determine to be the most performant thing.

AshtonSBradley commented 7 years ago

Update on timings for Julia 0.6: Matlab 75ms Julia 133ms

for real*complex matmul.

Does anyone know what kind of speedup I should expect if I compile against MKL? This http://imgur.com/a/rBOo8

suggests << factor of 2

PetrKryslUCSD commented 5 years ago

If I may make a suggestion, the documentation for mul! and its relatives should describe this issue. It is certainly not obvious that the order of the complex and real matrices will have an order of magnitude impact on the performance. In fact, the library code might also try to correct for this with the transpose of the product of transposes as suggested by @yanncalec (#30811), but this might have performance costs for large matrices.

AshtonSBradley commented 5 years ago

It has been a while, and a few things may have improved (also my knowledge) going into 1.0.

I think I can summarize the situation now as follows. Matlab:

Nrows=1024;
Ncols=1024;
A=rand(Nrows,Ncols)+i*rand(Nrows,Ncols);
B=rand(Nrows,Ncols);

tic;for j=1:50
B*A;
end
toc/50

gives 53.6ms. In julia, if I naively try the same operation I get:

julia> A=randn(Nrows,Ncols)+im*randn(Nrows,Ncols);

julia> C= randn(Nrows,Ncols);

julia> @btime C*A

711.881 ms (8 allocations: 16.00 MiB)

But if I do the type conversion first (since the result must be complex), then

julia> B=randn(Nrows,Ncols) |> complex;

julia> @btime B*A

41.860 ms (2 allocations: 16.00 MiB)

(no MKL or anything fancy here). So I think this is a non-issue once the user knows about avoiding promotion. Safe to close this?

PetrKryslUCSD commented 5 years ago

Performance confirmed (matrices four times bigger):

julia> @btime C*A;                                                          
  70.609 s (8 allocations: 247.08 MiB)                               
julia> @btime B*A;                                                                     
  7.598 s (2 allocations: 247.08 MiB)

However, even with this we are still not in a good place.

What if the matrix B also appears in another product as D*B. Then if the matrix was previously converted to a complex matrix, some performance will be sacrificed as the product would have been much faster if B was still real (about twice as fast on my machine).

oscardssmith commented 5 years ago

I just bumped into this. Any chance of a fix sometime soon?

oscardssmith commented 5 years ago

FYI, if someone overloaded the multiplication to

function *(A :: Array{Real},  Z::Array{Complex{T}}) where T
       zfl = reinterpret(T, Z)
       zre = @view zfl[1:2:end-1]
       zim = @view zfl[2:2:end]
       return zre*A .+zim*A.*im
end

it would be a speed improvement.

ViralBShah commented 3 years ago

We need a PR if there is performance to be gained.

PetrKryslUCSD commented 3 years ago

function *(A :: Array{Real},  Z::Array{Complex{T}}) where T
       zfl = reinterpret(T, Z)
       zre = @view zfl[1:2:end-1]
       zim = @view zfl[2:2:end]
       return zre*A .+zim*A.*im
end

Isn't this the product of Z*A instead of A*Z?

oscardssmith commented 3 years ago

yeah. that said, it can pretty easily be made to do the other.

PetrKryslUCSD commented 3 years ago

Does it rely on implementation details? Will reintepret always produce the correct matrices?

oscardssmith commented 3 years ago

It relies on arrays of complex numbers ending up in this form, but I'm pretty sure the memory layout of arrays and complex numbers are basically public behavior given how many people rely on it.

ViralBShah commented 3 years ago

Yes, we can count on the representation being stable.

PetrKryslUCSD commented 3 years ago

Excellent, looks like that might be a solution then.

AshtonSBradley commented 3 years ago

Ok interesting @oscardssmith . I couldn't get that code to work due to size issues. The following does run, but is much slower. So have I messed this up somehow?

function azmul(A :: Matrix{Float64},  Z::Matrix{Complex{T}}) where T
    zfl = reinterpret(T, Z)
    zre = @view zfl[1:2:end-1,:]
    zim = @view zfl[2:2:end,:]
    return A*zre .+(A*zim).*im
end

N = 1000
A = randn(N,N)
Z = randn(N,N)+im*randn(N,N)

@time A*Z; # 0.066214 seconds
@time azmul(A,Z); # 1.946057 seconds

PetrKryslUCSD commented 3 years ago

@AshtonSBradley It is possible that in this expression A*zim.*im you create another complex matrix zim.*im, which then multiplies a real matrix? Would these parentheses help? (A*zim).*im

AshtonSBradley commented 3 years ago

@PetrKryslUCSD unfortunately no, that doesn't fix it, but I will add the syntax to the example

andyferris commented 3 years ago

How about Complex.(A*zre, A*zim)?

AshtonSBradley commented 3 years ago

no gain there either @andyferris

PallHaraldsson commented 1 year ago

I think Matlab stores arrays of complex numbers as a struct-of-arrays (i.e., two full-size real-valued arrays), whereas julia (and C) use an array-of-structs (a single array of complex-valued objects). So in Matlab this particular optimization is trivial. (Both strategies have their advantages.)

That's quite interesting. Is Matlab faster because of this (at least in the case, or all for complex math)? If so should Julia support this way too? I suppose this could be done in a package, but it's not straight-forward, since then you need two arrays, or at least one array type that encapsulates both. Would it be viable to use a regular array, and the first half stores reals and rest imaginary part? [Then in the degenerate scalar case you get the current complex type, but I'm not sure it's helpful to know that, or exploitable.] What does e.g. NumPy do, does it support both ways (plausible since it supports e.g. Fortran and C style arrays/orders)? Does MKL? I think Matlab (and NumPy) use it.

[One advantage of our way is you can enlarge arrays, add to the end, and I think even in front, but that's not possible in a shared array except in O(n). For two arrays/not shared that's still possible O(1), if the array can be enlarged without moving it.]

HarmonDavid commented 1 year ago

Julia is capable of actual parallel processing.

Sent from Proton Mail for iOS

On Wed, Oct 11, 2023 at 08:13, Páll Haraldsson @.***(mailto:On Wed, Oct 11, 2023 at 08:13, Páll Haraldsson < wrote:

I think Matlab stores arrays of complex numbers as a struct-of-arrays (i.e., two full-size real-valued arrays), whereas julia (and C) use an array-of-structs (a single array of complex-valued objects). So in Matlab this particular optimization is trivial. (Both strategies have their advantages.)

That's quite interesting. Is Matlab faster because of this (at least in the case, or all for compex math)? If so should Julia support this way too? I suppose this could be done in a package, but it's not straight-forward, since then you need two arrays, or at least one array type that encapsulates both. Would it be viable to use a regular array, and the first half stores reals and rest imaginary part? [Then in the degenerate scalar case you get the current complex type, but I'm not sure it's helpful to know that, or exploitable.] What does e.g. NumPy do, does it support both ways (plausible since it supports e.g. Fortran and C style arrays/orders)? Does MKL? I think Matlab (and NumPy) use it.

[One advantage of our way is you can enlarge arrays, add to the end, and I think even in front, but that's not possible in a shared array except in O(n). For two arrays/not shared that's still possible O(1), if the array can be enlarged without moving it.]

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

PallHaraldsson commented 1 year ago

Julia is capable of actual parallel processing.

I'm not sure what you mean answering my comment that way. If the other, Matlab, memory organization is inherently better (at least for some operations), then parallel processing doesn't help. Possibly if Julia's is slower, and you do other things in parallel, you could make up for it. However matrix mul is O(n^3), but if the data you work on fits in cache it's actually O(n^2) in number of memory traffic (what matters most, or a lot), so parallel processing may actually hurt, unless you mean for it implicitly working on the same problem.

JuliaLang / LinearAlgebra.jl

requesting new method for * for: (real array)*(complex array) #395