bmad-sim / GTPSA.jl

Julia interface to the Generalised Truncated Power Series Algebra (GTPSA) library
https://bmad-sim.github.io/GTPSA.jl/
GNU General Public License v3.0
12 stars 0 forks source link

MutableArithmetics #113

Open nsajko opened 3 months ago

nsajko commented 3 months ago

https://github.com/jump-dev/MutableArithmetics.jl

This (hopefully) will make it very easy to handle temporaries and increase speed

After a while of trying to learn the very not well documented API for this interface, I found this only works for simple arithmetic (+,-,*,/) and no other functions, therefore I think it we will have to write our own temporary management. One idea I have is to have a macro @usetemps which appends some symbol to each of the operators ( similar to @. ) and converts the product of the expression to heap allocated TPS or ComplexTPS. Those special operators with appended symbols return temporary types which have special behaviors / avoid heap allocation if possible. If a temporary is included in the evaluation, it can be reused or mutated.

Originally posted by @mattsignorelli in https://github.com/bmad-sim/GTPSA.jl/issues/46#issuecomment-1869816519

It's not true that MA "only works for simple arithmetic", MA works for arbitrary functions. If you have questions I'd be happy to help.

mattsignorelli commented 3 months ago

The function I'm looking for from MutableArithmetics is @rewrite, which I didn't think worked for arbitrary functions from the discussion here: https://github.com/jump-dev/MutableArithmetics.jl/issues/253

From a developer in that discussion:

Does MutableArithmetics not work for general functions, only for the standard arithmetic operations (+,-,/,*)?

Correct the @rewrite macro works only for a limited subset of the standard arithmetic operations.

nsajko commented 3 months ago

I usually avoid @rewrite anyway, because it's too high-level. For optimal performance it's necessary to use operate!! etc. directly.

mattsignorelli commented 3 months ago

I usually avoid @rewrite anyway, because it's too high-level. For optimal performance it's necessary to use operate!! etc. directly.

I see. Yes the thought was to have a macro users could prepend to expressions containing the mutable struct, so that a pre-allocated temporary buffer could be used instead for the temporaries generated, while keeping the syntax simple and transparent to other types. The current @FastGTPSA macro I implemented does this, albeit with some weird tricks.

At least at the time, I couldn't see any major benefits to satisfying the MutableArithmetics interface other than for @rewrite, and had difficulty understanding how to do so. What other benefits would satisfying the interface offer?

nsajko commented 3 months ago

Looking at the @FastGTPSA docs you linked (thanks!), it seems like it relies on a global TPS vector for avoiding allocations? I guess that makes thread-safety impossible or at least expensive, which seems like a pretty big drawback?

Another thing that seems like a drawback of the @FastGTPSA approach is that, as far as I understand, it can only work for hardcoded TPS types, because a global TPS vector is required for each TPS type supported by @FastGTPSA, right?

At least at the time, I couldn't see any major benefits to satisfying the MutableArithmetics interface other than for @rewrite, and had difficulty understanding how to do so. What other benefits would satisfying the interface offer?

It'd allow users to, e.g., move allocations outside of loops or outright eliminate them in a generic and thread-safe manner, but at the cost of uglier and less convenient code.

Take this loop as an example:

for x ∈ vec
    s += f(x)
end

Assuming f needs to allocate its return value, and + perhaps also allocates, it might make sense to implement operate! for +, and operate_to! for f. Then these allocations could be eliminated in user code like so:

y = ... # preallocate
for x ∈ vec
    y = MA.operate_to!!(y, f, x)
    s = MA.operate!!(+, s, y)
end

Furthermore, if f allocates more than just its return value, it'd make sense to implement buffer_for and buffered_operate_to!, which would be used like this to move allocations from f to outside the loop:

y = ... # preallocate
buf = MA.buffer_for(f, eltype(vec))
for x ∈ vec
    y = MA.buffered_operate_to!!(buf, y, f, x)
    s = MA.operate!!(+, s, y)
end

Other MA functions to implement would be mutability and mutable_copy.

The purpose of using the !! functions instead of ! functions is making the code more generic: the !! functions are correct even when the MA API is not implemented.

mattsignorelli commented 3 months ago

Looking at the @FastGTPSA docs you linked (thanks!), it seems like it relies on a global TPS vector for avoiding allocations? I guess that makes thread-safety impossible or at least expensive, which seems like a pretty big drawback?

You are correct, with this approach internally, the macro in its current state is definitely not thread safe. The current plan is to switch to a thread-safe memory pool of TPSs which the temporaries can draw from. This would make the macro thread safe, but is not implemented yet.

Another thing that seems like a drawback of the @FastGTPSAapproach is that, as far as I understand, it can only work for hardcoded TPS types, because a global TPS vector is required for each TPS type supported by @FastGTPSA, right?

Kind of. Right now there are only 2 types of TPSs (TPS and ComplexTPS, both representing Float64s), both of which are supported by @FastGTPSA and have corresponding pre-allocated buffers. The macro basically changes the expression to use operators/functions that do special things when encountering a TPS or ComplexTPS, or default to the regular operator/function if any other type is encountered.

I do feel that in the long term, MutableArithmetics would be a good interface to implement here. However I would really like to see some kind of equivalent @FastGTPSA in MA that draws from a pre-allocated, thread-safe buffer of the mutable structs for expression evaluation. Take a look at some of the expressions in benchmark/track.jl. By simply prepending the macro to the expressions, the code stays very simple to read, is type-generic, and still gains all of the benefits of reusing temporaries. I fear rewriting these complicated expressions using operate!! and operate_to!! would make the code much less readable, and thus harder to maintain/debug.

nsajko commented 3 months ago

A thread-safe memory pool will have its costs, but it seems like a reasonable solution. Implementing the MA interfaces would enable users to squeeze out every last ounce of performance, when necessary, using (hopefully) familiar APIs like MA.operate!!.

I guess it's true that @rewrite would be more useful if it supported more than just basic arithmetic, however I probably won't be tackling that issue as I'm not good with macros, and in fact I try to stay away from them.

mattsignorelli commented 3 months ago

For now, I think I will wait until MA has better documentation and an equivalent @FastGTPSA/better @rewrite to implement the interface. Many months ago when I was making this decision, I read the README for MA and the paper it refers to there and still had a lot trouble understanding how to properly implement the interface. For this reason I fear MA.operate!! is not very familiar to outside users. I also feel having to rewrite simple expressions using operate!! would only make the code less-readable, and over-complicate things that are presently simple using @FastGTPSA (especially once the thread-safe pool is implemented).

I also am not the best with writing macros, @FastGTPSA was a major pain...

mattsignorelli commented 1 month ago

@nsajko can you point me to any documentation, or describe how to actually implement MutableArithmetics with steps? In fact if you have the time and are familiar with it, a PR to MutableArithmetics updating the documentation with such a guide with be valuable

nsajko commented 1 month ago

Does this example MA implementation help: kalmarek/Arblib.jl#178?

mattsignorelli commented 1 month ago

Does this example MA implementation help: kalmarek/Arblib.jl#178?

Thanks!

I thought about this a lot, but decided it seems best to stick with and improve my current implementation:

I have made @FastGTPSA thread safe along with some significant improvements (including no more weird symbols exported, and it can act on a block of code). E.g.

julia> d = Descriptor(3, 7); x = vars(d);

julia> y = rand(3);

julia> @btime @FastGTPSA begin   # 2 allocations per TPS, no temporaries generated:
               t1 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
               t2 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
               z  = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im;
              end;
  11.251 μs (4 allocations: 3.88 KiB)

julia> t3 = ComplexTPS64(); t4 = ComplexTPS64(); @gensym w; # pre-allocated TPSs and some scalar number

julia> @btime @FastGTPSA! begin # zero allocations
               $t3 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
               $t4 = $x[1]^3*sin($x[2])/log(2+$x[3])-exp($x[1]*$x[2])*im;
               $w  = $y[1]^3*sin($y[2])/log(2+$y[3])-exp($y[1]*$y[2])*im;
              end;
  10.930 μs (0 allocations: 0 bytes)

This basically does everything I could do with MA (and more actually, since @FastGTPSA works for any general function and not just the arithmetic operators), so at this point in time I don't see much benefit to satisfying the interface anymore.

On that note, I think my solution for @FastGTPSA might be of interest to others, since it supports all functions/operators, and by taking advantage of MD, it really is transparent to all non-TPS types while keeping the syntax completely simple