parallelize get_tunes_harmonic

swhite2401 commented 8 months ago

A parallelized version of get_tunes_harmonic based on python multiprocessing is proposed.

To be noted: it is important to load the processors to see some some speed-up. In case the workload per processor is too light you may adjust it with the pool_size argument (number of threads)

swhite2401 commented 8 months ago

@JeanLucPons , the output shape should be correct now, with pool_size=4 I get a speed-up of a factor 2 for the test file you gave me.

JeanLucPons commented 8 months ago

@swhite2401 I did a quick performance check and the results are a bit strange it seems that there is a ~ 1.5 sec overhead per get_tunes_harmonic() call which seems a bit high and when using pool_size=2 the gain is more than expected. Considering the overhead i would expect ~8sec for pool_size=2. I will check the results on a FMA. Thx

swhite2401 commented 8 months ago

@JeanLucPons , I confirm, I get exactly the same behavior... this is very strange. I will take a closer look at it.

swhite2401 commented 8 months ago

Update: it seems that the map used in multiprocess is much more efficient to iterate through the particles than the for loop used in the serial code. I will implement a map in the serial as well and see if we can get consistent results

swhite2401 commented 8 months ago

I have restructured the code and could not improve the situation. However, I saw that the serial calculation is 2x faster in nice than in grappa... this is strange but give consistent result with a factor 2 speed-up with 2 processes. The speed-up, for the example you gave me, saturates around 5-6 processes on grappa

oscarxblanco commented 8 months ago

I have made a test for a particular lattice, getting the horizontal tune from tracking with varying horizontal offsets. The values are the same in the serial or parallel version. All good on my side. test_parallel_get_tunes_harmonic.zip

I attach the test script and lattice in case they are of any use. Here is the output,

exec(open('test_parallel_get_tunes_harmonic.py').read())
loading and setting up ring
setup tracking
track
array_has_nan=False
test serial analysis
0 1 <class 'numpy.ndarray'> (1,)
1 1 <class 'numpy.ndarray'> (1,)
2 1 <class 'numpy.ndarray'> (1,)
3 1 <class 'numpy.ndarray'> (1,)
4 1 <class 'numpy.ndarray'> (1,)
5 1 <class 'numpy.ndarray'> (1,)
6 1 <class 'numpy.ndarray'> (1,)
7 1 <class 'numpy.ndarray'> (1,)
8 1 <class 'numpy.ndarray'> (1,)
9 1 <class 'numpy.ndarray'> (1,)
10 1 <class 'numpy.ndarray'> (1,)
11 1 <class 'numpy.ndarray'> (1,)
12 1 <class 'numpy.ndarray'> (1,)
13 1 <class 'numpy.ndarray'> (1,)
14 1 <class 'numpy.ndarray'> (1,)
15 1 <class 'numpy.ndarray'> (1,)
16 1 <class 'numpy.ndarray'> (1,)
17 1 <class 'numpy.ndarray'> (1,)
18 1 <class 'numpy.ndarray'> (1,)
19 1 <class 'numpy.ndarray'> (1,)
20 1 <class 'numpy.ndarray'> (1,)
test parallel analysis
Freq: Serial    vs      Parallel call, and difference
0.15550547777584314     0.15550547777584314     0.0
0.1554930130874591      0.1554930130874591      0.0
0.155481861623794       0.155481861623794       0.0
0.15547202316459072     0.15547202316459072     0.0
0.1554634975170763      0.1554634975170763      0.0
0.15545628451671034     0.15545628451671034     0.0
0.15545038402776618     0.15545038402776618     0.0
0.15544579594381305     0.15544579594381305     0.0
0.1554425201880633      0.1554425201880633      0.0
0.15544055671333398     0.15544055671333398     0.0
0.27603734161416893     0.27603734161416893     0.0
0.15544056657862776     0.15544056657862776     0.0
0.15544253997071045     0.15544253997071045     0.0
0.15544582576063856     0.15544582576063856     0.0
0.15545042405255285     0.15545042405255285     0.0
0.15545633498112718     0.15545633498112718     0.0
0.15546355871105055     0.15546355871105055     0.0
0.1554720954366531      0.1554720954366531      0.0
0.1554819453814049      0.1554819453814049      0.0
0.15549310879724512     0.15549310879724512     0.0
0.15550558596374864     0.15550558596374864     0.0

JeanLucPons commented 8 months ago

FMA results on EBS S28F:

Tracking on EBS done with AT 0.4.1

get_tune_harmonics() computation done with AT 0.3.1

get_tune_harmonics() computation done with AT 0.4.1

There is clearly an issue with the present 0.4.1 get_tune_harmonics() likely not link to parallelization.

Tracking done with GPU , get_tune_harmonics() with AT 0.3.1

JeanLucPons commented 8 months ago

htune0 = at.get_tunes_harmonic(cents=CU[0,:,:4096], fmin=0.001, fmax=0.499, use_mp=True, pool_size=4, maxiter=1000)
htune1 = at.get_tunes_harmonic(cents=CU[0,:,-4096:], fmin=0.001, fmax=0.499, use_mp=True, pool_size=4, maxiter=1000)
vtune0 = at.get_tunes_harmonic(cents=CU[1,:,:4096], fmin=0.001, fmax=0.499, use_mp=True, pool_size=4, maxiter=1000)
vtune1 = at.get_tunes_harmonic(cents=CU[1,:,-4096:], fmin=0.001, fmax=0.499, use_mp=True, pool_size=4, maxiter=1000)

Changing maxiter does not help :(

oscarxblanco commented 8 months ago

Dear @JeanLucPons , is the tracking also done in different version of pyat ? or is it just the analysis that differs ?

The main difference that I see is that at0.4 results show a blank section in a frequency region where particles are already very likely to be unstable. I wonder if data with nans or with non-well defined frequency is automatically dismissed, either due to tracking or the analysis itself. Have a look on the modification done here for example https://github.com/atcollab/at/pull/613

The plot done in at0.3.1 with GPU tracking data has finer grid, and looks much nicer, but to me they are equivalent.

JeanLucPons commented 8 months ago

Hi Oscar, Yes you're right, a large part of the region is rather unstable but not the one near x=0 at low amplitude. I don't think that there is a small offset for x=0 but the hole starting at 1.5mm is unexpected. The tracking on CPU has been done with AT 0.4.1. The tracking on GPU with new GPU code (not yet released). The problem is likely due to the PR you mentioned.

oscarxblanco commented 8 months ago

Dear @JeanLucPons I typically add 1 nm to the closed orbit in order to avoid passing numerical zeros to the frequency analysis library.

https://github.com/atcollab/at/blob/52298a90a5059e0858298a1cfec2a0da3f12c143/pyat/at/physics/frequency_maps.py#L178-L180

JeanLucPons commented 8 months ago

In fact the offset is there (1e-6). Here are the starting coordinates of the tracking i used.

x = np.linspace(1e-6, 0.013, num=nbParticlesX)
y = np.linspace(0.005,1e-6,  num=nbParticlesY)
xx, yy = np.meshgrid(x, y, sparse=False)
X = xx.flatten()   #initialize the horizontal and
Y = yy.flatten()   #the vertical initial position of each particle
Z01 = np.array([X,z,Y,z,z,z]);
Z1, *_ = ring.track(Z01, nturns=nturns, use_mp=True, pool_size=64)

oscarxblanco commented 8 months ago

@JeanLucPons Could you check if the tracking data is valid ? Are there any nan before the analysis ? For example,

[print(f'{numpy.sum(Z1[0,i,0,:]):.3f}') for i in range(nbParticlesX*nbParticlesY)]

JeanLucPons commented 8 months ago

Yes there are some nan is the result of the tracking (when outisde the DA). This was also something addressed in this PR.

If there are nan in the given signal, get_tunes_harmonic() should return nan (and no exception)
If no harmonics are found in the window [fmin,fmax], nan is also returned (also without exception)

This allows to get a compact code for doing FMA and IMHO it works well. nan are displayed in white in my above plots.

IMHO, the issue comes from the way harmonics are extracted, it is strange that maxtier do no help (may be 1000 is not enough), we have to perform more tests.

The DA looks like that:

oscarxblanco commented 8 months ago

@JeanLucPons, Is there coupling in the lattice you are testing ?

JeanLucPons commented 8 months ago

Yes Here are the starting tune at x=1e-6m (nominal Qx=0.21, Qy=0.34)

Using AT 0.3.1

Using AT 0.4.1

swhite2401 commented 8 months ago

Hello @JeanLucPons and @oscarxblanco, Thanks for looking into this, I have now added an optional input argument remove_mean that removes the constant part of the signal. It is set to True by default.

With this I get the following by trying to reproduce the results, this looks much better. There are still difference getting close to the integer of half integer resonances... but I am not sure we are using the same lattice.

JeanLucPons commented 8 months ago

Thanks @swhite2401 .

Now i get (defaut maxtier): We don't see the unstable zone, but it is up to you to decide if it is ok or not.

Using maxiter=1000, i get:

/operation/common/miniconda/envs/jlp/lib/python3.8/site-packages/at/physics/harmonic_analysis.py:35: RuntimeWarning: overflow encountered in absolute
  k = numpy.argmax(numpy.abs(complex_values))

I don't think i have seen these overflows before.

swhite2401 commented 8 months ago

Ok thanks, I'll look into it.

JeanLucPons commented 8 months ago

I add plots in tune space (around the nominal tunes) done with the last get_tunes_harmonic() with the most killing resonance lines.

CPU tracking:

GPU tracking:

oscarxblanco commented 8 months ago

Hello @JeanLucPons and @oscarxblanco, Thanks for looking into this, I have now added an optional input argument remove_mean that removes the constant part of the signal. It is set to True by default.

With this I get the following by trying to reproduce the results, this looks much better. There are still difference getting close to the integer of half integer resonances... but I am not sure we are using the same lattice.

Oh, I see... Now I understand why I didn't notice it. I always remove the mean before passing the data to get the tunes. https://github.com/atcollab/at/blob/52298a90a5059e0858298a1cfec2a0da3f12c143/pyat/at/physics/frequency_maps.py#L204-L229

swhite2401 commented 8 months ago

@JeanLucPons , I have found a bug with a normalization that caused the subtraction to be wrong. This is now corrected. However, there is still a small difference that depends on the number of harmonics that I search for in the input range.

For num_harmonics=1 (present situation for tune search):

For num_harmonics=20 (AT 0.3.0, works also if only 2 harmonics are considered):

It seems that in some cases the max. of the raw FFT is not coherent with the max. of the interpolated FFT so by considering only 1 harmonic one may miss the max.

I think I will restore it as an input parameter, with a default set to 2. Then the user may decide to change it if he sees some problems. What do you think?

swhite2401 commented 8 months ago

Oh, I see... Now I understand why I didn't notice it. I always remove the mean before passing the data to get the tunes.

Ah good! By the way we saw that in your FMA code some calculations are not used as they were implemented for the initial NAFF version, do you mind if I comment them out in this PR?

JeanLucPons commented 8 months ago

I did a test with your last release and still having these red dots.

Lattice: "simple_ebs.mat" (no RF cavity, no wiggler, no short bend)

(jlp) [grappa.pons] > conda list | grep 'sci\|acc\|numpy'
accelerator-toolbox       0.4.1.dev34+g52298a90          pypi_0    pypi
numpy                     1.24.2           py38h10c12cc_0    conda-forge
scipy                     1.10.1                   pypi_0    pypi

Tracking

import at
import numpy as np
import time
ring = at.load_lattice("simple_ebs.mat");
nturns=10000
nbParticlesX = 32
nbParticlesY = 32*3
nbParticles = nbParticlesX * nbParticlesY
z = np.zeros(nbParticles)
x = np.linspace(1e-6, 0.001, num=nbParticlesX)
y = np.linspace(0.0035,0.002,  num=nbParticlesY)
xx, yy = np.meshgrid(x, y, sparse=False)
X = xx.flatten()   #initialize the horizontal and
Y = yy.flatten()   #the vertical initial position of each particle
Z01 = np.array([X,z,Y,z,z,z]);
print("Starting pass")
t0 = time.perf_counter()
Z1, *_ = ring.track(Z01, nturns=nturns, use_mp=True, pool_size=64)
t1 = time.perf_counter()
print("Ending pass")
print(f"PASS {t1 - t0:0.4f} seconds")
np.save("coord_xy",Z1[[0,2],:,0,:])

FMA

import matplotlib as mplt
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import cm
import matplotlib.cbook as cbook
import matplotlib.colors as colors
import at
import math
import time

CU = np.load("coord_xy.npy")
nb = 32
I = np.empty([3*nb,nb])
print("Starting tune")
t0 = time.perf_counter()
htune0 = at.get_tunes_harmonic(cents=CU[0,:,:4096], fmin=0.001, fmax=0.499, use_mp=True, pool_size=4)
htune1 = at.get_tunes_harmonic(cents=CU[0,:,-4096:], fmin=0.001, fmax=0.499, use_mp=True, pool_size=4)
vtune0 = at.get_tunes_harmonic(cents=CU[1,:,:4096], fmin=0.001, fmax=0.499, use_mp=True, pool_size=4)
vtune1 = at.get_tunes_harmonic(cents=CU[1,:,-4096:], fmin=0.001, fmax=0.499, use_mp=True, pool_size=4)
t1 = time.perf_counter()
print("Ending tune")
print(f"PASS {t1 - t0:0.4f} seconds")

# fma
for y in range(3*nb):
    for x in range(nb):
        p = y*nb+x
        hdt = htune0[p] - htune1[p]
        vdt = vtune0[p] - vtune1[p]
        if np.isnan(hdt) or np.isnan(vdt):
            delta = np.nan
            I[y, x] = np.nan
        else:
            delta = math.sqrt(hdt * hdt + vdt * vdt)
            if delta < 1e-10:
                delta = 1e-10
            dt = math.log10(delta)
            I[y, x] = dt

mplt.colors.Normalize(vmin=-10.0, vmax=0.0)
cmap = mplt.colormaps['rainbow']
cmap.set_bad(color='white')
plt.xlabel('m')
plt.ylabel('m')
pos = plt.imshow(I,interpolation='none',extent=[0.0,0.001,0.002,0.0035], cmap=cmap, label="m")
plt.colorbar(pos,label='log\u2081\u2080( \u221A(dQx\u00B2+dQy\u00B2) )')
plt.show()

JeanLucPons commented 8 months ago

Sorry @swhite2401, i missed your first comment. With num_harmonics=2, the result is ok for me. I'm ok to have num_harmonics=2 as default.

swhite2401 commented 8 months ago

Ok great, I'll clean up everything and improve the documentation regarding num_harmonics, then I think it is ready to merge.

oscarxblanco commented 8 months ago

Oh, I see... Now I understand why I didn't notice it. I always remove the mean before passing the data to get the tunes.

Ah good! By the way we saw that in your FMA code some calculations are not used as they were implemented for the initial NAFF version, do you mind if I comment them out in this PR?

Please do.

swhite2401 commented 8 months ago

This is ready to be merged, any other comments?

atcollab / at

parallelize get_tunes_harmonic #681