atcollab / at

Accelerator Toolbox
Apache License 2.0
48 stars 31 forks source link

BugFix: memory error incollective effects circular buffers #771

Closed swhite2401 closed 4 months ago

swhite2401 commented 4 months ago

This PR fixes a bug in the collective effects implementation. Circular buffers are used to store the slice information and some parameters history. These buffers where shifted even when their size was equal to 1. In the new implementation, when the size of these buffers is equal to 1 they are simply overwritten.

lcarver commented 4 months ago

Just for future reference, this bug is a bit strange. While debugging some mpi scripts sent in by some new users, we found a specific configuration of parameters that returned segmentation faults.

For example (I omit many lines of code here), if the following code:


comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

Nbunches = 1332
nparts_per_bunch = 666
Npart = int(Nbunches * nparts_per_bunch / size) #Computes the good number for MPI

add_beamloading(fring, qfactor, rshunt, Nslice=50, blmode=BLMode.PHASOR, cavitymode=CavityMode.ACTIVE, VoltGain=0.0001, PhaseGain=0.0001)

is launched with mpirun -n 64 python file.py

We get a segmentation fault. Increasing the nparts_per_bunch solves the problem. Also changing Nslice from 50 to 51 solves the problem.

I have verified that the bug fixed proposed here solves the problem...but I'm not sure I fully understand the details.

lcarver commented 4 months ago

In line 176 of atimplib.c there is the line

    weight[i] *= bunch_currents[ib]/np_bunch[ib];

which should be replaced by

    if (np_bunch[ib] == 0.0) {
        weight[i] = 0.0;
        }
    else {
        weight[i] *= bunch_currents[ib]/np_bunch[ib];
    }
swhite2401 commented 4 months ago

Just for future reference, this bug is a bit strange. While debugging some mpi scripts sent in by some new users, we found a specific configuration of parameters that returned segmentation faults.

For example (I omit many lines of code here), if the following code:


comm = MPI.COMM_WORLD
size = comm.Get_size()
rank = comm.Get_rank()

Nbunches = 1332
nparts_per_bunch = 666
Npart = int(Nbunches * nparts_per_bunch / size) #Computes the good number for MPI

add_beamloading(fring, qfactor, rshunt, Nslice=50, blmode=BLMode.PHASOR, cavitymode=CavityMode.ACTIVE, VoltGain=0.0001, PhaseGain=0.0001)

is launched with mpirun -n 64 python file.py

We get a segmentation fault. Increasing the nparts_per_bunch solves the problem. Also changing Nslice from 50 to 51 solves the problem.

I have verified that the bug fixed proposed here solves the problem...but I'm not sure I fully understand the details.

You may take a look at this: https://en.wikipedia.org/wiki/Undefined_behavior

swhite2401 commented 4 months ago

I add @simoneliuzzo as reviewer because neither me or @lcarver can approve this PR