JuliaLinearAlgebra / BLASBenchmarksCPU.jl

Benchmark BLAS Libraries
MIT License
5 stars 2 forks source link

A way to disable MKL? #73

Closed freemin7 closed 2 years ago

freemin7 commented 2 years ago

Hey i am running on an ARM system where MKL doesn't exist as such. BLASBenchmarksCPU automatically (init) running mkl related code is not elegant. What would be a good way to handle this?

idevcde commented 2 years ago

Hey, just wondering, is it working with the way "Example 2" is suggesting or the problem occurs when the package is "starting"? https://julialinearalgebra.github.io/BLASBenchmarksCPU.jl/stable/usage/ Is it possible to use BLASBenchmarksCPU on ARM?

freemin7 commented 2 years ago

No the error happens before that. During using in the init function.

ERROR: LoadError: UndefVarError: libmkl_rt not defined
Stacktrace:
 [1] getproperty(x::Module, f::Symbol)
   @ Base ./Base.jl:35
 [2] top-level scope
   @ ~/.julia/packages/BLASBenchmarksCPU/63VfB/src/BLASBenchmarksCPU.jl:48
 [3] include
   @ ./Base.jl:420 [inlined]
 [4] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt64}}, source::Nothing)
   @ Base ./loading.jl:1318
 [5] top-level scope
   @ none:1
 [6] eval
   @ ./boot.jl:373 [inlined]
 [7] eval(x::Expr)
   @ Base.MainInclude ./client.jl:453
 [8] top-level scope
   @ none:1
in expression starting at /lustre/home/guest19/.julia/packages/BLASBenchmarksCPU/63VfB/src/BLASBenchmarksCPU.jl:1
ERROR: Failed to precompile BLASBenchmarksCPU [5fdc822c-4560-4d20-af7e-e5ee461714d5] to /lustre/home/guest19/.julia/compiled/v1.7/BLASBenchmarksCPU/jl_3yOXOv.
Stacktrace:
 [1] error(s::String)
   @ Base ./error.jl:33
 [2] compilecache(pkg::Base.PkgId, path::String, internal_stderr::IO, internal_stdout::IO, ignore_loaded_modules::Bool)
   @ Base ./loading.jl:1466
 [3] compilecache(pkg::Base.PkgId, path::String)
   @ Base ./loading.jl:1410
 [4] _require(pkg::Base.PkgId)
   @ Base ./loading.jl:1120
 [5] require(uuidkey::Base.PkgId)
   @ Base ./loading.jl:1013
 [6] require(into::Module, mod::Symbol)
   @ Base ./loading.jl:997
chriselrod commented 2 years ago

LoopVectorization will still crash on the A64FX, so Octavian at least won't work.

idevcde commented 2 years ago

Thanks for supporting FUGAKU! :-)

chriselrod commented 2 years ago

I hope to see FUGAKU and more vector-CPU super computers succeed in the future!

The issue that LoopVectorization/Octavian has looks like it'd need some digging to get into/isolate. I hope to replace the current code base, hopefully within the next few months, and thus would hope that those issues can be resolved/avoided at that time, rather than spending more time with the current approach.

idevcde commented 2 years ago

Great to hear that! And thank you again for making the appropriate changes to BLASBenchmarksCPU.jl so it can be run on Arm! AFAIK, FUGAKU is open for trials with projects screening per submission throughout the whole year. Heaving the opportunity, can I ask some very basic questions about BLAS libraries?

So far I used mostly OpenBLAS shipped natively with Julia. Following your advice provided at Julia discourse I also used MKL.jl with Julia 1.7 (Yeah, I know how it sounds, however, it was a very good advice). When I was using MKL instead of OpenBLAS with some of Julia packages I did not have to change anything in the code of the packages. It was as easy as writing "using MKL" and I understand I was doing calculations with MKL instead of OpenBLAS.

Is it the same with Octavian.jl? Or do I have to rewrite the code of the package in order to use Octavian.jl?

And also, I am hearing that on Arm, particularly Arm Performance Version of BLAS and BLIS is heaving a good performance. If I would like to try any of them (Arm Performance Version of BLAS or BLIS), how do I make it work with Julia on Arm? Should I build _jll and pin this _jll in Julia package mode? Is there maybe any tutorial that you are aware of / maybe you could point me to any resource on this topic?

And the last question, is it correct to assume that Octavian.jl if/when working on Arm / vector-CPUs could bring significant performance increase?

freemin7 commented 2 years ago

I address the questions i can answer.

Is it the same with Octavian.jl? Or do I have to rewrite the code of the package in order to use Octavian.jl?

Octavian exports as only public function is matmul!(C, A, B[, α, β, max_threads]) (according to docs). which means it doesn't change the definition of *(A::AbstractMatrix, B::AbstractMatrix) or similar calls. MKL.jl does that. This is a change that would need to be done by one person once. Without that an algorithmic rewrite of your code would be necessary.

How do I make Arm Performance Version of BLAS work with Julia on Arm?

If it doesn't exist already, you need to write an equivalent to MKL.jl or BLIS.jl.

Should I build _jll and pin this _jll in Julia package mode?

A _jll (which is a reproducible build instruction of the binaries for a selection of targets) is not enough alone. You will need to write a Julia wrapper which provides definitions for *(A::AbstractMatrix, B::AbstractMatrix) and many other calls. Since the API is probably quiet similar you can adapt from MKL.jl, OpenBLAS or BLIS.jl

How do I make BLIS work with Julia on Arm?

Try the package BLIS.jl see if it works and passes tests, if not make it work. (Details depend on failure mode)

Is there maybe any tutorial that you are aware of / maybe you could point me to any resource on this topic?

Look at the source code of similar packages and watch Developing Julia packages if you didn't already, although the recommendation to use TravisCI is outdated.

And the last question, is it correct to assume that Octavian.jl if/when working on Arm / vector-CPUs could bring significant performance increase?

In Julia as it's now, probably not, as Julia likes to emit NEON vector instructions which don't utilize the SVE vector registers. If Julia can be made to emit competitive SVE code then it is likely that Octavian with some tuning is competitive on A64FX.

chriselrod commented 2 years ago

On Julia 1.7+, the LinearAlgebra BLAS libraries use libblastrampoline:

julia> LinearAlgebra.BLAS.libblas
"libblastrampoline"

Which is what allows swapping BLAS implementations at runtime.

For this to work with a library like Octavian, it'd have to provide all the appropriate ccalls.

gemm is the building block of many more complicated BLAS/LAPACK algorithms, so it'd be interesting to see what the performance of LAPACK would be if you swap out OpenBLAS's gemm for Octavian (or even raw LV), which does much better at small sizes.

Longer term, these can be implemented in Julia, but I will be prioritizing rewriting LV before working on LinearAlgebra, as the rewrite will (a) help compile times and (b) make implementing many algorithms much easier.

chriselrod commented 2 years ago

If Julia can be made to emit competitive SVE code then it is likely that Octavian with some tuning is competitive on A64FX.

This requires setting the min-SVE bits arg, but that seems to be causing crashes.

idevcde commented 2 years ago

Thanks a lot for all the information! As for the Arm Performance Version of BLAS and writing a wrapper really I do not think its right for me to do it. As for BLIS I see that it probably performs better on Neoverse N1 than OpenBLAS that I used for testing (pls see here https://github.com/flame/blis/blob/master/docs/Performance.md). I hope to have the opportunity to carry some of the tests further with BLIS in the near future. I think I can say that Julia on Neoverse N1 was already competitive to x86 and GPU with standard OpenBLAS or at least this is my current understanding. As for potential competitive performance on A64fx I'm optimistic to do some tests in the near future.

gemm is the building block of many more complicated BLAS/LAPACK algorithms, so it'd be interesting to see what the performance of LAPACK would be if you swap out OpenBLAS's gemm for Octavian (or even raw LV), which does much better at small sizes.

I do not have precise knowledge regarding the sizes of matrices. The tests I was doing are related to AI trainings. I would be really happy to carry the tests suggested above in near (or medium / long term) future when my time permits (optimistic about it, however, sometimes there are some constraints) even if the setup would not be appropriate for the above mentioned AI problem. As for the technical requirements, I might need some advice then, as for now I have to admit that even after re-reading the posts, I would not know how to do it.