Closed HighIander closed 3 years ago
@HighIander Thanks for providing such a detailed benchmark :+1:. I assume with "quite" you mean quiet/regular initialization. Is this correct? Do you observe a similar behavior when choosing random initialization? (A Difference would surprise me, but it would be interesting to see how the start position influence the result in both simulations - and thus might point us into the right direction.)
I do not understand your last plot - I only see red and green lines, no purple orange or blue.
Just as a safety check, does this setup use complex number math? There is a problem with those #3379 in dev
, but not in the last release
I do not understand your last plot - I only see red and green lines, no purple orange or blue.
the dark blue was wrong labeling, the blue, orange and green however show 3 smilei simulations that are so identical that they block each other out (which is why the last colour, green, is the only one seen) of PIConGPU on the other hand you see 9 simulations that are all different in course(even though we made it a quiet start) which points to differences in reproducibility of cases of the two codes
To debug this behaviour I created a minimal example based on @wmles example https://github.com/psychocoderHPC/picongpu/tree/topic-debugFoil
pic-create $PICSRC/examples/FoilDebug/ foildDebug/
pic-build
do tbg -s -t -c etc/picongpu/duenn.cfg runs/esirkepov_tsc_64ppc/$(printf "%04d" $i); done
Compare Histogram of one timestep:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
fig, axlist = plt.subplots(1, 1, figsize=(14, 8))
zeile = 140
prefix="runs/esirkepov_tsc_64ppc"
for nr in range(0, 2):
arr = np.loadtxt(f"{prefix}/{nr:04d}/simOutput/H_energyHistogram_all.dat", skiprows=1)
plt.plot(np.linspace(0, 250, 1024)[2:-2], np.convolve(arr[zeile, 2:-2]*1024/250, [0.2]*5, mode='valid'), label=nr)
plt.yscale('log')
plt.xlim([5, 90])
ts = arr[zeile, 30]
time = int(round((ts - 11404) / 3394 * 25))
plt.legend()
plt.xlabel("E_H in MeV")
plt.ylabel("particle density")
Things to try, discussed with @psychocoderHPC
@
I manipulated PIConGPU to run ComputeCurrent
, copyGuardToExchange
, shiftParticles
and fillGaps
sequential. I used 4ppc per cell with the already posted setup and used for all particles the same starting position within the cell (middle of the cell). With this changes the results in multiple simulations are equal.
copyGuardToExchange
is the kernel which is called to send particles over the periodic boundaries. This kernel is packing particles that we can send those. The packing is highly parallel and uses atomic functions which will result in a random order of particles. Later we unpack the particles we receive and add the particles to the existing frame list. This process randomize the order of particles within the the frame linked list. I assume the processing order in the current deposition causes the particle dynamics between different simulation runs.
I think it can also theoretically be that the copyGuardToExchange
is somehow buggy in the parallel mode, like there is a data race, or wrong atomic parameters?
@HighIander Thanks for providing such a detailed benchmark 👍. I assume with "quite" you mean quite/regular initialization. Is this correct?
Yes
Do you observe a similar behavior when choosing random initialization? (A Difference would surprise me, but it would be interesting to see how the start position influence the result in both simulations - and thus might point us into the right direction.)
Yes
Just a few thoughts since I was pinged offline. For PIConGPU:
I did some simulations with the setup from https://github.com/ComputationalRadiationPhysics/picongpu/issues/3377#issuecomment-708934127
Can you do the same scans in double-precision?
esirkepov_tsc_16ppc_random_160k_res768_v100
@steindev helped me to look deeper into this issue: The issue is coming from the field solver in combination with the current deposition. There is a numerical issue which introduces a strange instability. We tried also the new Arbitrary Order pusher. Order 4+ is resulting into NAN's. @steindev will look next week again into that issue.
So here comes the detailed description of the issue we saw. What I describe in the following are the results of a reproducer I already set up after encountering this issue with @psychocoderHPC before.
The setup consists of a 64x64x64 volume with cubic cells and periodic boundaries in all directions.
A 17 cell thick slice in the x-y plane, located around the center of the volume, is homogeneously filled with electrons.
For initialization startPosition::Quiet
is used with 2 particles per cell.
Initially, all electrons are assigned a constant drift velocity of 0.5 * CELL_WIDTH / DELTA_T
along the z direction.
We use particles::shapes::CIC
, currentSolver::EsirkepovNative
, particles::pusher::HigueraCary
(due to better accuracy than Boris)
We analyze PIConGPU's checkpoint output after one time step.
~~As expected, the homogeneous density and drift produces a homogeneous current density resulting in a homogeneous Ez
field in the bulk of the foil, see left picture in the figure below showing the absolute value of the Ez
field normalized to its maximum value in order to enhance visibility of deviations and note the logarithmic color scale.
The magnetic Bx
and By
fields, however, are not homogeneous in contrast to our expectation.~~
In contrast to our expectation, the Ez
field is not constant within the bulk of the foil although it should be as the particle density and velocity is homogeneous.
See left picture in the figure below showing the absolute value of the difference between Ez
values within the foil and the Ez
value in the center of the foil.
You can see that Ez
values are constant within a super cell spacing along x
but that inaccuracies occur at super cell boundaries.
The origin of these differences may be inaccuracies in the current density calculation.
Consequently, these are found in the magnetic field components, too, see figure. Yet these are supposed to be zero in the bulk of the foil since the Ez field is supposed to be constant.
In my opinion, it is concerning that these inaccuracies vary at super cell boundaries.
The inaccuracies seem to be of the order of floating point precision. Running with double-precision decreases the inaccuracy, but does not remove it. See figure below. Note, I had to reduce the super cell size to 4x4x4 due to shared memory limitations.
The arbitrary order solver (using a time step according to the CFL criterion for Yee) seems to amplify these field irregularities as it develops a pronounced wave like field structure with a super cell wave length whose amplitude exponentially increases before resulting in nan
s.
EDIT: Thanks to @sbastrakov for pointing out a (stupid) error in my analysis.
I had to update my previous post. Thanks to @sbastrakov for pointing out a (stupid) error in my analysis.
So from the updated plots there are slight differences between Ez values in supercells, I guess due to current deposition using different summation orders. And they cause also slight differences of B components from theoretical 0 and between each other. However, all these discrepancies seem low and basically they are inevitable with a combination of floating-point and parallel computing we use. The fact they cause substantial differences to the test case results is worrying and begs a question if the setup poses an adequate test since it's so sensitive to practically unavoidable noise (unless we are missing some other sort of errors, of course)?
The question is why are so smalldifferences get exponential increased e.g. with the high order field solver. @steindev could it be we need to decrease the timestep because the higher oder solver has a lower cfl creteria?
btw: I wrote add the weekend a detection for one bit differences and flushed one bit differences in the field solver to zero. For the high order pusher it was not solving the issue but IMO due to the large timestep. I will push the diff one bit detection later.
Yes @psychocoderHPC that is indeed a question: maybe something is wrong in the implementation, or we accept a too wide range of parameters for the high-order solver. My question was generally about the setup, since we observed substantial discrepancies in cutoff energies for the standard solver as well.
The discrepancy is because the instability triggered by the precision error is to dominant. I have not tested the orginal test case I posted with the one bit error detection/smoother. What I tested with @steindev last Friday is to make a hole of one cell (x-direcrion) into the target. This results in very good results with 10 test simulations (2 order pusher). As @steindev said to me it could be that the hole is creating fields which are more dominant than the single bit fluctuation in the current.
I thought the same and queued a simulation using a tenth of the time step hours ago. Unfortunately, @n01r blocks the resources.
Patch to avoid that two single precision variable values will be subtracted if there is only a one bit difference. The patch is not generic (and hacky) and is only effecting the curlE
and there the calculation of the x component.
diff --git a/include/picongpu/fields/differentiation/ForwardDerivative.hpp b/include/picongpu/fields/differentiation/ForwardDerivative.hpp
index b247ee095..e5f2a6375 100644
--- a/include/picongpu/fields/differentiation/ForwardDerivative.hpp
+++ b/include/picongpu/fields/differentiation/ForwardDerivative.hpp
@@ -58,6 +58,11 @@ namespace differentiation
int
>::type;
+ HDINLINE int _float_as_int(float_X& x) const
+ {
+
+ return *((int*)(&x));
+ }
/** Return derivative value at the given point
*
* @tparam T_DataBox data box type with field data
@@ -72,7 +77,18 @@ namespace differentiation
Index,
T_direction
>();
- return ( data( upperIndex ) - data( Index{} ) ) /
+
+ float_X tmp = (data( upperIndex )[1] - data( Index{} )[1]) ;
+ bool cc = ((0x007fffff & _float_as_int(tmp)) == 0x007fffff) || (((0x007fffff ^ _float_as_int(tmp)) & 0x007fffff) == 0x007fffff );
+
+ auto res = ( data( upperIndex ) - data( Index{} ) );
+ if(T_direction==0 && data( Index{} )[1] != data( upperIndex )[1])
+ {
+ res.y() = tmp * float_X(!cc);
+ }
+
+ return res /
cellSize[ T_direction ];
}
};
The behaviour of the AOFDTD solver really is a numerical stability issue. Compare the following pictures with the CFL determined timestep for Yee (upper) and a tenth of it (lower). Pictures are taken at equal real time (due to the factor ten in Δt)
So this explains the instability we have seen with the AOFDTD method. But not the difference in cut-off energies observed with standard Yee scheme...
@steindev and me discussed yesterday offline the possibility that since the current deposition kernel runs on particle frames and writes the current atomically into the global data space there will be either neighboring super cells writing small contributions before or after the super cell itself deposited its current. The different order of summation might cause the loss of small current contributions causing errors in the current.
To test this, I wrote a small python script that models current deposition in 1D for equal velocity particles using 8 cells with 2x4 cells as "super cells". For CIC shaped particles, and random distributed particles, there seems to be no difference between super cell summation and random summation:
I will check higher order shapes and quiet initialization.
[EDIT after discussion with @wmles]: ~What worries me is the high noise level for low ppc.~ What worries me is the high max and min values for low ppc.
There seems to be no difference between random-addition and super-cell-wise-addition for a quiet start either:
Even if the currents cancel in the statical average, there seems to be no difference
Surprisingly, the supper-cell-wise-addition performs slightly better.
Same is true for random positions:
@psychocoderHPC I would thus suspect, that the issue does not arise from numeric errors caused by the parallel execution of ComputeCurrent
. The uncertainty caused by the particle distribution (either random or uniform) is much greater than the error caused by a partially ordered addition (which honestly surprised me).
@psychocoderHPC I would thus suspect, that the issue does not arise from numeric errors caused by the parallel execution of
ComputeCurrent
. The uncertainty caused by the particle distribution (either random or uniform) is much greater than the error caused by a partially ordered addition (which honestly surprised me).
I still think it is coming from noice on supercell borders. I saw single bit differences, this difference is than creating a magnetic field. Secondly the computational error in the current deposition is in positive direction (right side of the supercell) larger because the algorithm has a sum over cells in positive direction which is always evaluated from left to right (negative to positive) for each direction. With the checkerbord which can be used to depositit the current we do accumulate 4 supercells in 2D. So IMO the error is largest on right,bottom in 2D.
Patch to avoid that two single precision variable values will be subtracted if there is only a one bit difference. The patch is not generic (and hacky) and is only effecting the
curlE
and there the calculation of the x component.diff --git a/include/picongpu/fields/differentiation/ForwardDerivative.hpp b/include/picongpu/fields/differentiation/ForwardDerivative.hpp index b247ee095..e5f2a6375 100644 --- a/include/picongpu/fields/differentiation/ForwardDerivative.hpp +++ b/include/picongpu/fields/differentiation/ForwardDerivative.hpp @@ -58,6 +58,11 @@ namespace differentiation int >::type; + HDINLINE int _float_as_int(float_X& x) const + { + + return *((int*)(&x)); + } /** Return derivative value at the given point * * @tparam T_DataBox data box type with field data @@ -72,7 +77,18 @@ namespace differentiation Index, T_direction >(); - return ( data( upperIndex ) - data( Index{} ) ) / + + float_X tmp = (data( upperIndex )[1] - data( Index{} )[1]) ; + bool cc = ((0x007fffff & _float_as_int(tmp)) == 0x007fffff) || (((0x007fffff ^ _float_as_int(tmp)) & 0x007fffff) == 0x007fffff ); + + auto res = ( data( upperIndex ) - data( Index{} ) ); + if(T_direction==0 && data( Index{} )[1] != data( upperIndex )[1]) + { + res.y() = tmp * float_X(!cc); + } + + return res / cellSize[ T_direction ]; } };
I tested the simulation the one bit error correction, the fluctuation in the result still exists.
I tested also serial CPU (one core) + one bit correction. The fluctuation in the results still exists. In the test simulation I changed the compute order of particles in the compute current for each run.
after talking with @HighIander, I decided to test if the fluctuation also happened if we use a slightly different charge per particle.
I run on GPU a test where each supercell was calculated sequentially. Each ion
got a slightly different charge by adding bound electrons for hydrogen to a value between [0,0.1)
and cloning electrons from these ions. Electrons also got bound electrons and the trait to calculate the charge based on bound electrons got updated to have the correct sign for the charge for electrons.
I used 16ppc quite (4x4) to speed up the simulation.
The single bit filter in the filed solver was activated.
return
frame::getCharge< typename T_Particle::FrameType >() * -1.0_X *
( particle[boundElectrons_] - protonNumber ) *
weighting;
After some further investigation attemps and discussion today involving quite a few people, we found out an issue. Not sure yet if that is the issue causing this effect. But seems definitely a problem.
The intial temperature in this, and probably some other, setups seems to be chosen so that the corresponding Debye length is not resolved by the grid. (In particular, if no temperature is set at all, no grid could resolve it). This violation does not necessarily lead to wrong results. But it could and a couple of quick tests by @psychocoderHPC to add a minimum temperature so that we resolve Debye length seem to decrease the differences between runs by a lot. This result is so far preliminary, the investigation is to be continued.
I think to confirm it, would also help to take some classical setup for cold-beam instability from the Hockney book (I guess?)
Not sure, but maybe this paper and some of its references are relevant for this issue.
That's cracy. I never observed large problems with T=0,but obviously it might be dangerous after all. I'll check the references and try to understand why, thx so far!
That's cracy. I never observed large problems with T=0,but obviously it might be dangerous after all. I'll check the references and try to understand why, thx so far!
T->0 is one of the well known cases where PIC fails conceptually and produces wrong results. If you have never seen this before, you were either extremely lucky or other effects were hiding/dissipating the problem.
What was the Initial temperature for the SMILEI sims?
@HighIander: Maybe of interest too: an original paper by Birdsall et al (https://doi.org/10.1016/0021-9991(80)90171-0) which @PrometheusPi found after we encountered the issue the first time where I remembered the effect to be described in the Hockney book as the cold streaming plasma instability. In Hockney it is described in ch. 7-3-3 The Warm-Plasma Approximation. Back then we saw it in a bunch acceleration where the beam's divergence increased during free propagation. Just as now, the issue is resolved by a higher resolution (i.e. reduced Δx) and/or a higher initial bunch divergence/temperature.
Here are my results for the toy setup with 64ppc at timestep 12k (short before first particles left the global volume on the laser side) and timestep 23k (before particles left the volume on the right side of the simulation volume).
The plot shows 100 simulations with 70 keV initial temperature and 10 simulations with 0 keV temperature.
PIConGPU was running on GPU (V100) with full parallelism and the current deposition strategy StridedCachedSupercells
.
There is still a fluctuation but the spread is much lower.
I move the ball now back to the physicists to correct their setup and to update the documentation of PIConGPU to make all users more aware of these effects.
@sbastrakov is working on a validator to print a warning after the simulation is initialized if needed.
I can also recommend section 4.1 of the paper (https://doi.org/10.3847/1538-4357/aa6d13) which describes the problem in short.
@sbastrakov is working on a validator to print a warning after the simulation is initialized if needed.
It would actually be cool if we could factor in all initial conditions for the validation somehow. Like, how the chosen resolution, number of particles, particle shape, activated ionization and collision all affect this issue.
To complete the tests:
We checked if ionization(ADK) is helping to reduce the fluctuation
@steindev ask me to test if reducing the pulse length of the laser (plane wave) is helping too, because reducing the pulse length will end in faster increase of the laser energy.
The simulation is halfing the pulse length
Dear all, I think we need a dedicated approach to testing that brings everybody up to speed.
@HighIander I think we can close this issue, can't we?
I opened an issue that we do not forget to add a temperature to the foil LCT example #3438
Synopsis:
We performed a set of "simple" ion acceleration test simulations and compared them against smilei. The results are sometimes in full agreement, sometimes they are off by more than 30%.
Those are the details:
We simulated a transversely periodic box with a transversely plane laser. Target was pure H+electrons. Other parameters; [coming soon!]
We varied a0 (5,10,20) and n (50,100,200). Our FOM is the evolution of the maximum H energy over time (cut-off edge, i.e. not the single one most energetic ion).
Results:
1. Scanning a0:
2. Scanning n0:
3. Comparison of the spectrum of a0=20,n=200: