atcollab / at

Accelerator Toolbox
Apache License 2.0
48 stars 31 forks source link

Add momap multirefpts #773

Closed oscarxblanco closed 3 months ago

oscarxblanco commented 4 months ago

This is a work in progress proposal to calculate the momentum aperture of a ring where particles from all reference points are tracked at the same time, while the energy offset step is reduced on every iteration.

It is expected to be faster due to the possibility of tracking all particles at once, and the reduced number of iterations on the stable energy boundary search.

It differs from the acceptance method where every refpts is tracked through separated call to tracking functions.

swhite2401 commented 4 months ago

Hello @oscarxblanco very interesting development, this was in fact on my todo list to allow LT calculation with GPU from @JeanLucPons.

To be noted: in his improved implement of at.c @JeanLucPons has introduced tracking from a starting refpts, but since this is not yet distributed your solution is quite nice.

First comment: could you separate functionalities? Especially the initial tracking to move all particle to a single refpts is extremely useful. If it could be use separately we could use its output to feed the functions already available in acceptance.

oscarxblanco commented 4 months ago

Citing the work done by @ZeusMarti here #92.

JeanLucPons commented 4 months ago

Hello,

As said by Simon, having the way to having all points "projected" (tracked) up to the end of the turn is a nice idea and may be integrated in C code.

More details on GPU code: For the GPU code, starting refs are handled by doing modulos on particle indices at the tracking level, the overhead is clearly not visible. However to get best gpu performance there are some constraints to fill GPU cores usage that you don't have on CPUs. There no are no re-arrangement of particles when some of them are lost and an extra modulo is needed to get back refpts of elements where particle are lost (if needed). The re-arrangement of stride is a tricky task not yet done and can bring benefit only when the number of stride exceed number of GPU cores (I mean actual GPU cores, not CUDA cores). At the moment, a GPU stride is automatically freed when all its particles are lost.

                      "    tracking_starts: numpy array of indices of elements where tracking should start.\n"
                      "       len(tracking_start) must divide the number of particle and it gives the stride size.\n"
                      "       The i-th particle of rin starts at elem tracking_start[i/stride].\n"
                      "       The behavior is similar to lattice.rotate(tracking_starts[i/stride]).\n"
                      "       The stride size should be multiple of 64 for best performance.\n"
oscarxblanco commented 4 months ago

Dear @JeanLucPons in my opinion particle losses is a must also in the GPU version.

I worked yesterday on the tracking of particles from multiple points to a single reference point. It seems to work, but, I realized that there are two cases: 1) The information of the starting reference point need to be preserved. 2) The information of the starting reference point may be removed. I've been thinking on how to cover both in a simple way, but, at the moment I would need to create a flag called 'group' to reduce case 1) to case 2).

I will push later today a new version when I feel satisfied by the tests.

The version I'm writing works on python. As you said, it could be implemented in directly in C but right now I don't see the advantage for a CPU because it is still a process where you need to iterate over the reference points either in C or python. I might be missing some memory allocation control where C could be of advantage.

swhite2401 commented 4 months ago

@oscarxblanco moving all the particles to a single point allows to track them all together in a single ring.track() call, no iteration is needed after that. This is the ideal case for parallelization whether it is GPU/CPU python or C.

A simple solution concerning your 2 cases is to track the same number of particles per refpts, it is then easy to split the results at the end of your tracking

oscarxblanco commented 4 months ago

@oscarxblanco moving all the particles to a single point allows to track them all together in a single ring.track() call, no iteration is needed after that. This is the ideal case for parallelization whether it is GPU/CPU python or C.

I agree, all comments I did to @JeanLucPons above refer to the tracking of multiple reference points to a single reference point in a function I called projectrefpts which is still under debugging.

A simple solution concerning your 2 cases is to track the same number of particles per refpts, it is then easy to split the results at the end of your tracking

Right, I think the splitting or grouping should be provided inside the function because it prevents user mistakes during the reshaping.

JeanLucPons commented 4 months ago

It would be nice to have the same interface for both the CPU and GPU tracking functions. If you have the startrefpts vector then you can call projectrefpts() and then track as before. Then I can also remove the modulo in the GPU tracking code. Adding the limitation that Simon suggests is nice because it also helps to cover the GPU constraints. It will just return a survived turn map with a +/- 1 turn offset (don't think it is a big issue) and lost_at_element indices stay the same as in the input ring which is intuitive. This is my humble opinion but i may missed something.

oscarxblanco commented 4 months ago

It would be nice to have the same interface for both the CPU and GPU tracking functions.

Dear @JeanLucPons , do you mean the whole momentum aperture function? or the projectrefpts function ?

If you have the startrefpts vector then you can call projectrefpts() and then track as before. Then I can also remove the modulo in the GPU tracking code.

It seems you refer to projectrefpts. By vector do you mean to enforce a 1D-numpy array type ? I just tested and it works with a list or a 1D-numpy array. Should I change something ?

Adding the limitation that Simon suggests is nice because it also helps to cover the GPU constraints.

I think again you refer to have a separate function to track multiple reference points to an end point, right? In that case I agree on the advantages, although, I see no limitation. It is implemented in projectrefpts Or perhaps you referred to @swhite2401 's proposal to limit the function to track the same number of particles per reference point ? If that is the case, it is already implemented.

It will just return a survived turn map with a +/- 1 turn offset (don't think it is a big issue) and lost_at_element indices stay the same as in the input ring which is intuitive. This is my humble opinion but i may missed something.

Thank you @JeanLucPons . So if I understand well your request is already implemented ! You could try it like

refpts = ring.uint32_refpts(at.All)
outparts, lostmask = at.momap_alternative.projectrefpts(ring, refpts, 'group'=True)
aliveparts = outparts[:,~lostmask,0,0]
oscarxblanco commented 4 months ago

Dear @swhite2401 , I'm having troubles with python3.7 and flake8. image The code inside momap_alternative.py was passed though black and it runs well in my pc with python 3.11.2. From the test output I think flake8 version 5 was installed. I downgraded flake8 but I can not get the error reported above.

Do you have any suggestion on what should I test ?

swhite2401 commented 4 months ago

No idea...@lfarv do you know what could be the issue?

lfarv commented 4 months ago

@oscarxblanco : the problem seems to be with python 3.7, and I get the following errors:

"Python version 3.7 does not support equality signs in f-strings" in momap_alternative_py, lines 162 and 226

Can you try after correcting that?

oscarxblanco commented 4 months ago

Great, that solved the building issue. Thank you @lfarv

lfarv commented 4 months ago

@oscarxblanco: By the way, in your new functions, you declare "ring" as a list. It should rather be declared as Lattice, since you are using may Lattice methods (ring.uint32_refpts, ring.radiation_parameters, ring.find_orbit…).

oscarxblanco commented 4 months ago

@lfarv , noted. It will be solved in the next push.

JeanLucPons commented 4 months ago

It would be nice to have the same interface for both the CPU and GPU tracking functions.

Dear @JeanLucPons , do you mean the whole momentum aperture function? or the projectrefpts function ?

Dear @oscarxblanco Yes I speak about the projectrefpts function. IMHO it would be nice to have this feature available directly in the Lattice.track function allowing parallel tracking of particles starting at different refpts independently of the type of calculation you want to make. if startrefpts is not set nothing change.

oscarxblanco commented 4 months ago

Dear @JeanLucPons, in ring.track the particle input is (6,N) for N particles, while projectrefpts is (6,N,R,1) to project N particles per reference point from R reference points. These are incompatible.

I could simplify even more projectrefpts to track only one particle from every reference point, therefore, the input will be (6,R), is that ok ? If I understood correctly, you want this code to be inside lattice_track, right ?

I think at this point projectrefpts is not longer needed for momentum aperture but could be put in the track.py, Do you agree ? This will provide a multi-particle and multi-reference point tracking to an end-point on the lattice.

oscarxblanco commented 4 months ago

Dear @JeanLucPons , a second possibility would be to have (6,NxR) particles and do modulo arithmetic when the 'startrefpts' argument is passed to lattice_track.

If you want to make CPU and GPU compatible, what is more convenient inside lattice_track ? (6,R) and one particle per reference point ? or (6,NxR) and do modulos ?

JeanLucPons commented 4 months ago

Yes I think it could be a good idea to move projectrefpts to track.py. I don't think that modulos are needed (I do this at the moment but your solution is simpler)

The idea is:

Tracking using one pre-projectrefpts and one post-projectrefpts calls:

startingElem = np.array([0,100,200,300,400,500,600,700],dtype=np.uint32)
rout, *_ = ring.track(rin, nturns=nturns, startrefpts=startingElem)

Limitations:

Standard tracking (all particle start at first element):

rout, *_ = ring.track(rin, nturns=nturns)

Potential issue: when several observation points are given, it may exists "hole" that should be filled but how ?

swhite2401 commented 3 months ago

Hello @oscarxblanco , @JeanLucPons ,

Maybe it is needed to take a little step back back, the PR seems to introduce 2 new features, please correct me if I misunderstood something: -a dicotomic search of the chaotic boundary. This method is faster but cannot profit from parallelization and is known to overestimate the aperture which is fine for me but has to be clearly documented. -for the lifetime calculation it proposes to pre-track particles to projects them at a single observation point. This is interesting to allow for LT calculation with a single call to the C tracking engine and simplifies CPU/GPU calculations. This is nevertheless only useful for lifetime calculation, at least I cannot think of other applications....

This leads to several questions: -Should we integrate the dicotomic boundary search in the existing aperture module as another grid method? -Should we integrate something that is only used for LT calculations in the general tracking function?

For info, I had in mind to completely refactor the aperture sub-package to optimize it for GPU. This includes the multi-refpts mode but also multi-grid for a single refpts (tracking X-Y and X-DP DA in on shot for example). This is a significant development and unfortunately I have had not time to start it for the moment....

My proposal would then be to keep this PR independent in a separate module for the moment as initially proposed, to allow to quickly merge it and then at a later stage fully integrate it in existing modules when we refactor the aperture sub-package.

What do you think?

oscarxblanco commented 3 months ago

@swhite2401 ok, understood. I will move back to the state before modifying the tracking functions.

Apart from that I have a comment,

-a dicotomic search of the chaotic boundary. This method is faster but cannot profit from parallelization and is known to overestimate the aperture which is fine for me but has to be clearly documented.

It is parallelized in this PR. 1) All particles start with a test_energy et which is a guessed value equal to half the bucket height. 2) Then, every reference point is tracked to the end of the ring. 3) Particles lost up to this point are removed from tracking. 4) Alive particles are filtered: they pass the filter if their coordinates are different by more than $\epsilon$ in the 6D space. 5) Filtered particles are grouped and tracked together in parallel. 6) Depending on the survival or death of the particles at the test energy et, the stable energy es and the unstable energy eu at every reference point are redefined. 7) We redefine the et as the mean between stable and unstable. 8) We loop from 2). The loop stops when the difference is smaller than some tolerance.

You are right about the overestimation of the algorithm.

swhite2401 commented 3 months ago

Sure but since you need to iterate I can't be fully parallelized as opposed to the case where you just send a vector of particles that you know contains the boundary which I think in the end will always be more efficient providing you have sufficient number of CPU/GPU.

oscarxblanco commented 3 months ago

Dear @swhite2401 and @JeanLucPons This is ready to review, who should be doing it ? On my side I have to note that the matlab and python functions have different names. In AT matlab : MomAperture_Project2Start In pyat : momaperture_project2start

Let me know if you prefer to change the names.

Here is a piece of code that could be use to run the test.

"""
This script tests the bipartition algorith developed by Z. Marti
2024may08 oblanco
"""

import at
import numpy
import time

print(f'at.__version__ = {at.__version__}')

refpts = numpy.array(range(100))
ring =  at.load_mat('THERING2024may16.mat')

nturns = 1000

spos = ring.get_s_pos(refpts)

oo, orbit = ring.find_orbit(refpts)

start_time = time.time()
dnp = at.momap_alternative.momaperture_project2start( ring,
                                                refpts=refpts,
                                                nturns=nturns,
                                                dptol=1e-3,
                                                use_mp=True,
                                                verbose=True)
print("--- %s seconds ---" % (time.time() - start_time))
boundary = dnp
oscarxblanco commented 3 months ago

And here is the matlab test code. I add one to positions to skip the Ring Parameters element, this in order to calculate the momentum aperture on the same reference points in both, matlab and python.

%%
positions = [1:100]+1;
Nturns = 1000;
tic
[ETN, ETP]=MomAperture_Projecect2Start(THERING,positions,Nturns,1e-3,[],[1e-5 1e-5],true);
toc
swhite2401 commented 3 months ago

Ok thanks @oscarxblanco! I can review it, is this urgent?

JeanLucPons commented 3 months ago

Hello @oscarxblanco , @swhite2401,

-Should we integrate something that is only used for LT calculations in the general tracking function?

What do you think?

For me the main objective is to have a compatible low level tracking function for GPU/CPU and Matlab/Python. Having the projectrefpts related stuff inside the acceptance module is not an issue for me but I need to know in order to remove modulos in the GPU code.

oscarxblanco commented 3 months ago

Ok thanks @oscarxblanco! I can review it, is this urgent?

No, it could be low priority.

oscarxblanco commented 3 months ago

For completeness, I leave here a plot comparing the momentum aperture result on the positive energy side on 100 points of a ring.

The yellow and blue colors show the dead and alive particle zones after 1000 turns when tracking a swarm of particles with a small energy step occupying all the energy range on each reference point. The light blue curve is obtained with AT matlab MomAperture_allRing modified to take into account the closed orbit. The red curve uses the new function, called for the moment, MomAperture_Project2Start. As you see, different variations of the energy step produce different results that could change the Touschek scattering results by some amount, typically a small amount. image

Here below is a figure showing a small difference between momaperture_project2start in pyat, and MomAperture_Project2Start in AT matlab. The small differences between the two results seems to come from low-level libraries used by python and matlab, see for example discussion #772 . This does not pose a problem because this algorithm is meant to be a faster but less precise result with respect to other algorithms with finer energy step. image

oscarxblanco commented 3 months ago

Dear @swhite2401 , could you check the matlab version ? it has been modified to track positive and negative energy thresholds in the same iteration. And the $\epsilon$ is now an option. Here is an example, with $\epsilon=10^{-9}$. NOTE: I have not yet used any parallelization function in matlab. I believe this depends on a toolbox installation and having the license to use it. Should I added any test about it?

tic
[ETN, ETP] = MomAperture_Project2Start(THERING, ...
        'refpts' , positions, ...
        'nturns', Nturns, ...
        'detole', 1e-3, ...
        'euguess', [], ...
        'troffset', [1e-5 1e-5], ...
        'verbose', true, ...
        'epsilon6D', 1e-9);
toc
oscarxblanco commented 3 months ago

Dear @swhite2401 , I have modified the python and matlab versions of the bipartition algorithm to find the momentum aperture, tracking for multiple turns both positive and negative test energies on a single call on every energy step.

The criteria to track only particles differing by some $\epsilon$ is now an option, and not the default behavior.

Wrt to the parallelization in matlab, it is not implemented. It seems to require the Parallel Computing Toolbox, and at least a test with ``tf = canUseParallelPool(). Should I add it ?

Wrt to parallelization in python, all messages in the doc have been removed. The calculation is done in serial, but one could still set use_mp=True through keyword arguments that are passed to at.lattice_track.

I leave here a call example in python and in matlab. The matlab arguments are not longer positional.

start_time = time.time()
dnp = at.momap_alternative.momaperture_project2start( ring,
                                                refpts=refpts,
                                                nturns=nturns,
                                                dptol=1e-3,
                                                use_mp=True,
                                                epsilon6d=1e-9,
                                                verbose=True)
print("--- %s seconds ---" % (time.time() - start_time))
tic
[ETN, ETP]=MomAperture_Project2Start(THERING, ...
        'refpts' , positions, ...
        'nturns', Nturns, ...
        'dptol', 1e-3, ...
        'euguess',[], ...
        'troffset', [1e-5 1e-5], ...
        'verbose', true,...
        'epsilon6D', 1e-9);
toc
lfarv commented 3 months ago

@oscarxblanco: Sorry to intervene again… Still 2 details: I get 2 warnings:

PEP 8: E302 expected 2 blank lines, found 1: 18 Local variable 'zin' value is not used: 245

oscarxblanco commented 3 months ago

@oscarxblanco: Sorry to intervene again… Still 2 details: I get 2 warnings:

PEP 8: E302 expected 2 blank lines, found 1: 18 Local variable 'zin' value is not used: 245

Dear @lfarv , all comments are welcome. I still don't know why flake8 missed them. I did the changes even if I don't see those two warnings.

@@ -15,6 +15,7 @@ __all__ = ["momaperture_project2start", "projectrefpts"]
 #           Based on the MATLAB implementation by Z.Marti at ALBA
 #           See https://github.com/atcollab/at/pull/773

+
 def momaperture_project2start(ring: Lattice, **kwargs: Dict[str, any]) -> numpy.ndarray:
     """
     :py:func:`momap_project2start` calculates the local momemtum aperture.
@@ -242,7 +243,6 @@ def projectrefpts(
     if nparticles == 1:
         use_mp = False

-    zin = numpy.zeros((6, nparticles, nrps, 1))
     if groupparts:
oscarxblanco commented 3 months ago

Dear @lfarv , the doc string in python and matlab has been modified to say momentum, not energy. Also the parameter euguess is now dpuguess

See the last commit.

swhite2401 commented 3 months ago

All good for me