Open AshtonSBradley opened 7 years ago
None of the two calculates should be doing extra conversion or multiplying by zero. The difference is whether BLAS is used.
We might want to change the default behavior to convert the arguments to the same type. That might also save us some compile time (and therefore test time) because _generic_matmatmul!
is quite slow to compile.
not sure I understand the solution here. Do you mean BLAS vs LAPACK ? how do I force one over the other?
We have genericJulia code for matrix multiplication, not just BLAS. (LAPACK leverages BLAS to do more complex things like inversion and de compositions).
BLAS supports only a small set of types - Float32/64 and Complex thereof. I'm not sure if there is a BLAS function for real*complex, but in your example I understand that the Julia code is used.
Nonetheless, using the striding mechanism we might be able to use BLAS calls with no extra allocations, I hope. Seems worthwhile and might be what MATLAB does.
@andreasnoack avoiding extra allocations can be important for large matrices, so temporary conversions might not be optimal.
I think Matlab stores arrays of complex numbers as a struct-of-arrays (i.e., two full-size real-valued arrays), whereas julia (and C) use an array-of-structs (a single array of complex-valued objects). So in Matlab this particular optimization is trivial. (Both strategies have their advantages.)
From a performance perspective, the overhead of converting the type would be trivial for anything except the smallest matrices (O(N^2) for the conversion vs O(N^3) to do the multiplication). If we're worried about memory consumption, we could use a hybrid approach and do the conversion on tiles.
Of course, best would be to get BLAS-level performance from Julia code and skip calling BLAS altogether. For some algorithms we're there, but not for matmul.
Nonetheless, using the striding mechanism we might be able to use BLAS calls with no extra allocations, I hope. Seems worthwhile and might be what MATLAB does.
I don't think this is the case because BLAS-3 requires the first stride to be contiguous and splitting the computation into a sequence of BLAS-2 (which allow non-unit stride) calls will most likely kill performance.
I don't think XBLAS implementations are well maintained so copying to a complex matrix might be our best option here.
Ok, interesting, it will be interesting to see how this develops. I don't know what was happening with Matlab when I ran the above tests, but revisiting it carefully, it only takes 55ms for the Matlab code posted above, so julia seems to be slower than Matlab by a factor >3, for B*A with B real and A complex in the sense declared above.
Breaking up the calculation into real products and casting the result as complex brings the factor closer to 2.
This is certainly a common/important enough case that we should have an optimized method for it that does whatever we determine to be the most performant thing.
Update on timings for Julia 0.6: Matlab 75ms Julia 133ms
for real*complex matmul.
Does anyone know what kind of speedup I should expect if I compile against MKL? This http://imgur.com/a/rBOo8
suggests << factor of 2
If I may make a suggestion, the documentation for mul!
and its relatives should describe this issue. It is certainly not obvious that the order of the complex and real matrices will have an order of magnitude impact on the performance. In fact, the library code might also try to correct for this with the transpose of the product of transposes as suggested by @yanncalec (#30811), but this might have performance costs for large matrices.
It has been a while, and a few things may have improved (also my knowledge) going into 1.0.
I think I can summarize the situation now as follows. Matlab:
Nrows=1024;
Ncols=1024;
A=rand(Nrows,Ncols)+i*rand(Nrows,Ncols);
B=rand(Nrows,Ncols);
tic;for j=1:50
B*A;
end
toc/50
gives 53.6ms. In julia, if I naively try the same operation I get:
julia> A=randn(Nrows,Ncols)+im*randn(Nrows,Ncols);
julia> C= randn(Nrows,Ncols);
julia> @btime C*A
711.881 ms (8 allocations: 16.00 MiB)
But if I do the type conversion first (since the result must be complex), then
julia> B=randn(Nrows,Ncols) |> complex;
julia> @btime B*A
41.860 ms (2 allocations: 16.00 MiB)
(no MKL or anything fancy here). So I think this is a non-issue once the user knows about avoiding promotion. Safe to close this?
Performance confirmed (matrices four times bigger):
julia> @btime C*A;
70.609 s (8 allocations: 247.08 MiB)
julia> @btime B*A;
7.598 s (2 allocations: 247.08 MiB)
However, even with this we are still not in a good place.
What if the matrix B
also appears in another product as
D*B
. Then if the matrix was previously converted to a complex matrix, some performance will be sacrificed as the product would have been much faster if B
was still real (about twice as fast on my machine).
I just bumped into this. Any chance of a fix sometime soon?
FYI, if someone overloaded the multiplication to
function *(A :: Array{Real}, Z::Array{Complex{T}}) where T
zfl = reinterpret(T, Z)
zre = @view zfl[1:2:end-1]
zim = @view zfl[2:2:end]
return zre*A .+zim*A.*im
end
it would be a speed improvement.
We need a PR if there is performance to be gained.
function *(A :: Array{Real}, Z::Array{Complex{T}}) where T zfl = reinterpret(T, Z) zre = @view zfl[1:2:end-1] zim = @view zfl[2:2:end] return zre*A .+zim*A.*im end
Isn't this the product of
Z*A
instead ofA*Z
?
yeah. that said, it can pretty easily be made to do the other.
Does it rely on implementation details? Will reintepret always produce the correct matrices?
It relies on arrays of complex numbers ending up in this form, but I'm pretty sure the memory layout of arrays and complex numbers are basically public behavior given how many people rely on it.
Yes, we can count on the representation being stable.
Excellent, looks like that might be a solution then.
Ok interesting @oscardssmith . I couldn't get that code to work due to size issues. The following does run, but is much slower. So have I messed this up somehow?
function azmul(A :: Matrix{Float64}, Z::Matrix{Complex{T}}) where T
zfl = reinterpret(T, Z)
zre = @view zfl[1:2:end-1,:]
zim = @view zfl[2:2:end,:]
return A*zre .+(A*zim).*im
end
N = 1000
A = randn(N,N)
Z = randn(N,N)+im*randn(N,N)
@time A*Z; # 0.066214 seconds
@time azmul(A,Z); # 1.946057 seconds
@AshtonSBradley It is possible that in this expression A*zim.*im
you create another complex matrix zim.*im
, which then multiplies a real matrix? Would these parentheses help? (A*zim).*im
@PetrKryslUCSD unfortunately no, that doesn't fix it, but I will add the syntax to the example
How about Complex.(A*zre, A*zim)
?
no gain there either @andyferris
I think Matlab stores arrays of complex numbers as a struct-of-arrays (i.e., two full-size real-valued arrays), whereas julia (and C) use an array-of-structs (a single array of complex-valued objects). So in Matlab this particular optimization is trivial. (Both strategies have their advantages.)
That's quite interesting. Is Matlab faster because of this (at least in the case, or all for complex math)? If so should Julia support this way too? I suppose this could be done in a package, but it's not straight-forward, since then you need two arrays, or at least one array type that encapsulates both. Would it be viable to use a regular array, and the first half stores reals and rest imaginary part? [Then in the degenerate scalar case you get the current complex type, but I'm not sure it's helpful to know that, or exploitable.] What does e.g. NumPy do, does it support both ways (plausible since it supports e.g. Fortran and C style arrays/orders)? Does MKL? I think Matlab (and NumPy) use it.
[One advantage of our way is you can enlarge arrays, add to the end, and I think even in front, but that's not possible in a shared array except in O(n). For two arrays/not shared that's still possible O(1), if the array can be enlarged without moving it.]
Julia is capable of actual parallel processing.
Sent from Proton Mail for iOS
On Wed, Oct 11, 2023 at 08:13, Páll Haraldsson @.***(mailto:On Wed, Oct 11, 2023 at 08:13, Páll Haraldsson < wrote:
I think Matlab stores arrays of complex numbers as a struct-of-arrays (i.e., two full-size real-valued arrays), whereas julia (and C) use an array-of-structs (a single array of complex-valued objects). So in Matlab this particular optimization is trivial. (Both strategies have their advantages.)
That's quite interesting. Is Matlab faster because of this (at least in the case, or all for compex math)? If so should Julia support this way too? I suppose this could be done in a package, but it's not straight-forward, since then you need two arrays, or at least one array type that encapsulates both. Would it be viable to use a regular array, and the first half stores reals and rest imaginary part? [Then in the degenerate scalar case you get the current complex type, but I'm not sure it's helpful to know that, or exploitable.] What does e.g. NumPy do, does it support both ways (plausible since it supports e.g. Fortran and C style arrays/orders)? Does MKL? I think Matlab (and NumPy) use it.
[One advantage of our way is you can enlarge arrays, add to the end, and I think even in front, but that's not possible in a shared array except in O(n). For two arrays/not shared that's still possible O(1), if the array can be enlarged without moving it.]
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Julia is capable of actual parallel processing.
I'm not sure what you mean answering my comment that way. If the other, Matlab, memory organization is inherently better (at least for some operations), then parallel processing doesn't help. Possibly if Julia's is slower, and you do other things in parallel, you could make up for it. However matrix mul is O(n^3), but if the data you work on fits in cache it's actually O(n^2) in number of memory traffic (what matters most, or a lot), so parallel processing may actually hurt, unless you mean for it implicitly working on the same problem.
In MATLAB, if I do
the timing is 114ms
If I compare this to Julia:
However, if I break down the product into real and imaginary parts:
and do
the results is closer to MATLAB. If I instead declare B::Array{Float64,2}, the mean time extends to 1.6s, presumably due to type conversion.
If I understand this correctly, it seems that the extra multiply-by-zero operations are still being evaluated by Julia in B*A when B is real.
Is there a better way to declare that B is real, for which a method will give performance gains?
Is it possible to add another method to * for the situation
(B::Array{Float64,2})*(A::Array{Complex{Float64},2})
that will get the full performance? Presumably it would then be faster than MATLAB...
Thanks in advance for any comments.