Nemocas / Nemo.jl

Julia bindings for various mathematical libraries (including flint2)
http://nemocas.github.io/Nemo.jl/
Other
176 stars 57 forks source link

Nemo Galois fields decrease performance compared to AbstractAlgebra #1693

Open Ktrompfl opened 4 months ago

Ktrompfl commented 4 months ago

Using Galois fields from Nemo instead of AbstractAlgebra heavily impacts the performance of basic arithmetic operations due to allocations for every single operation:

julia> using AbstractAlgebra, BenchmarkTools, Nemo

Welcome to Nemo version 0.43.1

Nemo comes with absolutely no warranty whatsoever

julia> @btime begin
           K = AbstractAlgebra.GF(2)
           a = one(K)

           for _ = 1:1_000_000
               a *= a
           end
       end

  247.656 μs (2 allocations: 40 bytes)

julia> @btime begin
           K = Nemo.GF(2)
           a = one(K)

           for _ = 1:1_000_000
               a *= a
           end
       end

  66.673 ms (1000006 allocations: 76.29 MiB)

julia> @btime begin
           K = AbstractAlgebra.GF(2)
           a = one(K)

           for _ = 1:1_000_000
               a += a
           end
       end
  78.806 ns (2 allocations: 40 bytes)

julia> @btime begin
           K = Nemo.GF(2)
           a = one(K)

           for _ = 1:1_000_000
               a += a
           end
       end

  68.983 ms (1000006 allocations: 76.29 MiB)

I assume this is a byproduct of calling native libraries, but still very unfortunate.

thofma commented 4 months ago

Yes and no. Try using Native.GF(2) (once Nemo is loaded).

Ktrompfl commented 4 months ago

Thanks for your response. Apart from being slightly faster on addition and not causing any allocations at all, Native.GF still appears to be a magnitude slower on multiplication when compared to AbstractAlgebra.GF:

julia> @btime begin
           K = Native.GF(2)
           a = one(K)

           for _ = 1:1_000_000
               a *= a
           end
       end

  5.737 ms (0 allocations: 0 bytes)

julia> @btime begin
           K = Native.GF(2)
           a = one(K)

           for _ = 1:1_000_000
               a += a
           end
       end

  44.278 ns (0 allocations: 0 bytes)
thofma commented 4 months ago

Hm, yes, it is quite bad, but a bit puzzling. @fieker: The AbstractAlgebra code is not that sophisticated in comparison to Nemo/flint, which goes through n_mulmod_preinv. Is this just the overhead of the function call or is there something else going on?

fieker commented 4 months ago

my guess is 2 things:

My guess is that Native.GF(2) will outperform AA on tasks taking place in C (in flint), however "naive" native julia code will favour AA by a wide margin. Furthermore, each "+" in Flint is a function call, while AA is handled directly by the compiler.... I guess we can/ should implement the Native.GF(p) arithmetic directly ourselves

fieker commented 4 months ago

I withdraw my comments. fpFieldElem is usng preinv and is immutable julia. is calling into C however, so the call mihght be much more expensive (at least it should!!) thay checking if ab is even or odd. Has anyone done the timings on 30 bit or 60 bit primes as well? The Native code is not at all optimized for small p

thofma commented 4 months ago

I tried 30 bits and got the same timings. We could in fact just copy the code from AA to Nemo.

thofma commented 3 months ago

@albinahlback since you interested in flint being fast, do you have an idea what could be going on? How long does

julia> @btime begin
           K = Native.GF(1099511627791)
           a = one(K)

           for _ = 1:1_000_000
               a *= a
           end
       end

  5.737 ms (0 allocations: 0 bytes)

take for you and how long does that take for you with an equivalent C program using nmod? It is clear that the ccall on the julia side is not for free, but I wonder how expensive it is.

albinahlback commented 3 months ago

Do you know if AbstractAlgebra-stuff is being inlined or not?

albinahlback commented 3 months ago

If someone would be able to give me the generated assembly for each case, it would be pretty easy to see why the difference is so big.

thofma commented 3 months ago

Hm, I tried to produce the assembly code, but I noticed something strange. It seems that there is something dodgy going one due to some problems in the benchmark setup.

@Ktrompfl can you reproduce the following?

function f_with_return(K)
   a = one(K)
  for _ = 1:1_000_000
    a *= a
  end
  return a
end

function f_without_return(K)
   a = one(K)
  for _ = 1:1_000_000
    a *= a
  end
end

K_AA = AbstractAlgebra.GF(1099511627791)
K_Nemo = Nemo.Native.GF(1099511627791)

I then get

julia> t_AA = @belapsed(f_with_return(K_AA)), @belapsed(f_without_return(K_AA))
(0.005825292, 0.000312042)

julia> t_Nemo = @belapsed(f_with_return(K_Nemo)), @belapsed(f_without_return(K_Nemo))
(0.012017084, 0.012015667)

So clearly the compiler is doing something "clever" in the AA case, since the function always returns nothing.

Looks like it is 2x slower, which is not "too bad".

Ktrompfl commented 2 months ago

With the setup above I get

julia> t_AA = @belapsed(f_with_return(K_AA)), @belapsed(f_without_return(K_AA))
(0.003119424, 0.000221444)

julia> t_Nemo = @belapsed(f_with_return(K_Nemo)), @belapsed(f_without_return(K_Nemo))
(0.005218629, 0.005237672)

which should be fine, whereas with K_Nemo2 = Nemo.GF(1099511627791) I get

julia> t_Nemo2 = @belapsed(f_with_return(K_Nemo2)), @belapsed(f_without_return(K_Nemo2))
(0.05128647, 0.051381936)