HighIander commented 4 years ago

Synopsis:

We performed a set of "simple" ion acceleration test simulations and compared them against smilei. The results are sometimes in full agreement, sometimes they are off by more than 30%.

Those are the details:

We simulated a transversely periodic box with a transversely plane laser. Target was pure H+electrons. Other parameters; [coming soon!]

cell size
time step
ionization
particles per cell
laser pulse duration
traditional absorbing boundaries; laser is outside of absorbing cells

We varied a0 (5,10,20) and n (50,100,200). Our FOM is the evolution of the maximum H energy over time (cut-off edge, i.e. not the single one most energetic ion).

Results:

1. Scanning a0: benchmark2_a05_n200 benchmark2_a010_n200 benchmark2_a20_n200

2. Scanning n0: benchmark2_a010_n50 benchmark2_a010_n100 benchmark2_a010_n200

3. Comparison of the spectrum of a0=20,n=200: picongpu9+smilei3_450_thin_quiet

PrometheusPi commented 4 years ago

@HighIander Thanks for providing such a detailed benchmark :+1:. I assume with "quite" you mean quiet/regular initialization. Is this correct? Do you observe a similar behavior when choosing random initialization? (A Difference would surprise me, but it would be interesting to see how the start position influence the result in both simulations - and thus might point us into the right direction.)

I do not understand your last plot - I only see red and green lines, no purple orange or blue.

sbastrakov commented 4 years ago

Just as a safety check, does this setup use complex number math? There is a problem with those #3379 in dev, but not in the last release

paschk31 commented 4 years ago

I do not understand your last plot - I only see red and green lines, no purple orange or blue.

the dark blue was wrong labeling, the blue, orange and green however show 3 smilei simulations that are so identical that they block each other out (which is why the last colour, green, is the only one seen) of PIConGPU on the other hand you see 9 simulations that are all different in course(even though we made it a quiet start) which points to differences in reproducibility of cases of the two codes

psychocoderHPC commented 4 years ago

To debug this behaviour I created a minimal example based on @wmles example https://github.com/psychocoderHPC/picongpu/tree/topic-debugFoil

pic-create $PICSRC/examples/FoilDebug/ foildDebug/
pic-build
do tbg -s -t -c etc/picongpu/duenn.cfg runs/esirkepov_tsc_64ppc/$(printf "%04d" $i); done

Compare Histogram of one timestep:

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec

fig, axlist = plt.subplots(1, 1, figsize=(14, 8))
zeile = 140

prefix="runs/esirkepov_tsc_64ppc"

for nr in range(0, 2):
    arr = np.loadtxt(f"{prefix}/{nr:04d}/simOutput/H_energyHistogram_all.dat", skiprows=1)
    plt.plot(np.linspace(0, 250, 1024)[2:-2], np.convolve(arr[zeile, 2:-2]*1024/250, [0.2]*5, mode='valid'), label=nr)
plt.yscale('log')
plt.xlim([5, 90])

ts = arr[zeile, 30]
time = int(round((ts - 11404) / 3394 * 25))
plt.legend()
plt.xlabel("E_H in MeV")
plt.ylabel("particle density")

sbastrakov commented 4 years ago

Things to try, discussed with @psychocoderHPC

@

psychocoderHPC commented 4 years ago

I manipulated PIConGPU to run ComputeCurrent, copyGuardToExchange, shiftParticles and fillGaps sequential. I used 4ppc per cell with the already posted setup and used for all particles the same starting position within the cell (middle of the cell). With this changes the results in multiple simulations are equal. copyGuardToExchange is the kernel which is called to send particles over the periodic boundaries. This kernel is packing particles that we can send those. The packing is highly parallel and uses atomic functions which will result in a random order of particles. Later we unpack the particles we receive and add the particles to the existing frame list. This process randomize the order of particles within the the frame linked list. I assume the processing order in the current deposition causes the particle dynamics between different simulation runs.

sbastrakov commented 4 years ago

I think it can also theoretically be that the copyGuardToExchange is somehow buggy in the parallel mode, like there is a data race, or wrong atomic parameters?

HighIander commented 4 years ago

@HighIander Thanks for providing such a detailed benchmark 👍. I assume with "quite" you mean quite/regular initialization. Is this correct?

Yes

Do you observe a similar behavior when choosing random initialization? (A Difference would surprise me, but it would be interesting to see how the start position influence the result in both simulations - and thus might point us into the right direction.)

Yes

ax3l commented 4 years ago

Just a few thoughts since I was pinged offline. For PIConGPU:

try a convergence scan with more cells and particles
try the convergence scan with both double and single precision
consider implementing Galerkin interpolation instead of trilinear interpolation for less noise

psychocoderHPC commented 4 years ago

I did some simulations with the setup from https://github.com/ComputationalRadiationPhysics/picongpu/issues/3377#issuecomment-708934127

80k time steps
volume 48x20000 cells
PARAM_RES=384 (twice as high as the default)

esirkepov_tsc_16x16ppc_80k_res384

16x16ppc/species quite start

esirkepov_tsc_4x4ppc_80k_res384

4x4ppc/species quite start

esirkepov_tsc_16ppc_random_80k_res384

16ppc/species random start

esirkepov_tsc_64ppc_random_80k_res384

64ppc/species random start

ax3l commented 4 years ago

Can you do the same scans in double-precision?

psychocoderHPC commented 4 years ago

esirkepov_tsc_16ppc_random_80k_res384_doublePrecision_v100

double precision
16ppc random

psychocoderHPC commented 4 years ago

esirkepov_tsc_16ppc_random_160k_res768_v100

single precision
16ppc
resolution 768 factor 2 better in each dimension than the runs before

psychocoderHPC commented 4 years ago

@steindev helped me to look deeper into this issue: The issue is coming from the field solver in combination with the current deposition. There is a numerical issue which introduces a strange instability. We tried also the new Arbitrary Order pusher. Order 4+ is resulting into NAN's. @steindev will look next week again into that issue.

steindev commented 4 years ago

So here comes the detailed description of the issue we saw. What I describe in the following are the results of a reproducer I already set up after encountering this issue with @psychocoderHPC before. The setup consists of a 64x64x64 volume with cubic cells and periodic boundaries in all directions. A 17 cell thick slice in the x-y plane, located around the center of the volume, is homogeneously filled with electrons. For initialization startPosition::Quiet is used with 2 particles per cell. Initially, all electrons are assigned a constant drift velocity of 0.5 * CELL_WIDTH / DELTA_Talong the z direction. We use particles::shapes::CIC, currentSolver::EsirkepovNative, particles::pusher::HigueraCary(due to better accuracy than Boris)

We analyze PIConGPU's checkpoint output after one time step. ~~As expected, the homogeneous density and drift produces a homogeneous current density resulting in a homogeneous Ez field in the bulk of the foil, see left picture in the figure below showing the absolute value of the Ez field normalized to its maximum value in order to enhance visibility of deviations and note the logarithmic color scale. The magnetic Bx and By fields, however, are not homogeneous in contrast to our expectation.~~ In contrast to our expectation, the Ez field is not constant within the bulk of the foil although it should be as the particle density and velocity is homogeneous. See left picture in the figure below showing the absolute value of the difference between Ez values within the foil and the Ez value in the center of the foil. You can see that Ez values are constant within a super cell spacing along x but that inaccuracies occur at super cell boundaries. The origin of these differences may be inaccuracies in the current density calculation.

Consequently, these are found in the magnetic field components, too, see figure. Yet these are supposed to be zero in the bulk of the foil since the Ez field is supposed to be constant.

In my opinion, it is concerning that these inaccuracies vary at super cell boundaries.

The inaccuracies seem to be of the order of floating point precision. Running with double-precision decreases the inaccuracy, but does not remove it. See figure below. Note, I had to reduce the super cell size to 4x4x4 due to shared memory limitations.

The arbitrary order solver (using a time step according to the CFL criterion for Yee) seems to amplify these field irregularities as it develops a pronounced wave like field structure with a super cell wave length whose amplitude exponentially increases before resulting in nans.

EDIT: Thanks to @sbastrakov for pointing out a (stupid) error in my analysis.

steindev commented 4 years ago

I had to update my previous post. Thanks to @sbastrakov for pointing out a (stupid) error in my analysis.

sbastrakov commented 4 years ago

So from the updated plots there are slight differences between Ez values in supercells, I guess due to current deposition using different summation orders. And they cause also slight differences of B components from theoretical 0 and between each other. However, all these discrepancies seem low and basically they are inevitable with a combination of floating-point and parallel computing we use. The fact they cause substantial differences to the test case results is worrying and begs a question if the setup poses an adequate test since it's so sensitive to practically unavoidable noise (unless we are missing some other sort of errors, of course)?

psychocoderHPC commented 4 years ago

The question is why are so smalldifferences get exponential increased e.g. with the high order field solver. @steindev could it be we need to decrease the timestep because the higher oder solver has a lower cfl creteria?

btw: I wrote add the weekend a detection for one bit differences and flushed one bit differences in the field solver to zero. For the high order pusher it was not solving the issue but IMO due to the large timestep. I will push the diff one bit detection later.

sbastrakov commented 4 years ago

Yes @psychocoderHPC that is indeed a question: maybe something is wrong in the implementation, or we accept a too wide range of parameters for the high-order solver. My question was generally about the setup, since we observed substantial discrepancies in cutoff energies for the standard solver as well.

psychocoderHPC commented 4 years ago

The discrepancy is because the instability triggered by the precision error is to dominant. I have not tested the orginal test case I posted with the one bit error detection/smoother. What I tested with @steindev last Friday is to make a hole of one cell (x-direcrion) into the target. This results in very good results with 10 test simulations (2 order pusher). As @steindev said to me it could be that the hole is creating fields which are more dominant than the single bit fluctuation in the current.

steindev commented 4 years ago

I thought the same and queued a simulation using a tenth of the time step hours ago. Unfortunately, @n01r blocks the resources.

psychocoderHPC commented 4 years ago

Patch to avoid that two single precision variable values will be subtracted if there is only a one bit difference. The patch is not generic (and hacky) and is only effecting the curlE and there the calculation of the x component.

diff --git a/include/picongpu/fields/differentiation/ForwardDerivative.hpp b/include/picongpu/fields/differentiation/ForwardDerivative.hpp
index b247ee095..e5f2a6375 100644
--- a/include/picongpu/fields/differentiation/ForwardDerivative.hpp
+++ b/include/picongpu/fields/differentiation/ForwardDerivative.hpp
@@ -58,6 +58,11 @@ namespace differentiation
             int
         >::type;

+        HDINLINE int _float_as_int(float_X& x) const
+        {
+
+            return *((int*)(&x));
+        }
         /** Return derivative value at the given point
          *
          * @tparam T_DataBox data box type with field data
@@ -72,7 +77,18 @@ namespace differentiation
                 Index,
                 T_direction
             >();
-            return ( data( upperIndex ) - data( Index{} ) ) /
+
+            float_X tmp = (data( upperIndex )[1] - data( Index{} )[1]) ;
+            bool cc = ((0x007fffff & _float_as_int(tmp)) == 0x007fffff) ||  (((0x007fffff ^ _float_as_int(tmp)) & 0x007fffff) == 0x007fffff );
+
+            auto res = ( data( upperIndex ) - data( Index{} ) );
+            if(T_direction==0 && data( Index{} )[1] != data( upperIndex )[1])
+            {
+                res.y() = tmp * float_X(!cc);
+            }
+
+            return res /
                 cellSize[ T_direction ];
         }
     };

steindev commented 4 years ago

The behaviour of the AOFDTD solver really is a numerical stability issue. Compare the following pictures with the CFL determined timestep for Yee (upper) and a tenth of it (lower). Pictures are taken at equal real time (due to the factor ten in Δt)

steindev commented 4 years ago

So this explains the instability we have seen with the AOFDTD method. But not the difference in cut-off energies observed with standard Yee scheme...

PrometheusPi commented 4 years ago

@steindev and me discussed yesterday offline the possibility that since the current deposition kernel runs on particle frames and writes the current atomically into the global data space there will be either neighboring super cells writing small contributions before or after the super cell itself deposited its current. The different order of summation might cause the loss of small current contributions causing errors in the current.

To test this, I wrote a small python script that models current deposition in 1D for equal velocity particles using 8 cells with 2x4 cells as "super cells". For CIC shaped particles, and random distributed particles, there seems to be no difference between super cell summation and random summation: grafik

I will check higher order shapes and quiet initialization.

[EDIT after discussion with @wmles]: ~What worries me is the high noise level for low ppc.~ What worries me is the high max and min values for low ppc.

PrometheusPi commented 4 years ago

There seems to be no difference between random-addition and super-cell-wise-addition for a quiet start either: grafik

PrometheusPi commented 4 years ago

Even if the currents cancel in the statical average, there seems to be no difference grafik

Surprisingly, the supper-cell-wise-addition performs slightly better.

Same is true for random positions: grafik

PrometheusPi commented 4 years ago

@psychocoderHPC I would thus suspect, that the issue does not arise from numeric errors caused by the parallel execution of ComputeCurrent. The uncertainty caused by the particle distribution (either random or uniform) is much greater than the error caused by a partially ordered addition (which honestly surprised me).

psychocoderHPC commented 4 years ago

@psychocoderHPC I would thus suspect, that the issue does not arise from numeric errors caused by the parallel execution of ComputeCurrent. The uncertainty caused by the particle distribution (either random or uniform) is much greater than the error caused by a partially ordered addition (which honestly surprised me).

I still think it is coming from noice on supercell borders. I saw single bit differences, this difference is than creating a magnetic field. Secondly the computational error in the current deposition is in positive direction (right side of the supercell) larger because the algorithm has a sum over cells in positive direction which is always evaluated from left to right (negative to positive) for each direction. With the checkerbord which can be used to depositit the current we do accumulate 4 supercells in 2D. So IMO the error is largest on right,bottom in 2D.

psychocoderHPC commented 3 years ago

Patch to avoid that two single precision variable values will be subtracted if there is only a one bit difference. The patch is not generic (and hacky) and is only effecting the curlE and there the calculation of the x component.

diff --git a/include/picongpu/fields/differentiation/ForwardDerivative.hpp b/include/picongpu/fields/differentiation/ForwardDerivative.hpp
index b247ee095..e5f2a6375 100644
--- a/include/picongpu/fields/differentiation/ForwardDerivative.hpp
+++ b/include/picongpu/fields/differentiation/ForwardDerivative.hpp
@@ -58,6 +58,11 @@ namespace differentiation
             int
         >::type;

+        HDINLINE int _float_as_int(float_X& x) const
+        {
+
+            return *((int*)(&x));
+        }
         /** Return derivative value at the given point
          *
          * @tparam T_DataBox data box type with field data
@@ -72,7 +77,18 @@ namespace differentiation
                 Index,
                 T_direction
             >();
-            return ( data( upperIndex ) - data( Index{} ) ) /
+
+            float_X tmp = (data( upperIndex )[1] - data( Index{} )[1]) ;
+            bool cc = ((0x007fffff & _float_as_int(tmp)) == 0x007fffff) ||  (((0x007fffff ^ _float_as_int(tmp)) & 0x007fffff) == 0x007fffff );
+
+            auto res = ( data( upperIndex ) - data( Index{} ) );
+            if(T_direction==0 && data( Index{} )[1] != data( upperIndex )[1])
+            {
+                res.y() = tmp * float_X(!cc);
+            }
+
+            return res /
                 cellSize[ T_direction ];
         }
     };

I tested the simulation the one bit error correction, the fluctuation in the result still exists.

psychocoderHPC commented 3 years ago

I tested also serial CPU (one core) + one bit correction. The fluctuation in the results still exists. In the test simulation I changed the compute order of particles in the compute current for each run.

psychocoderHPC commented 3 years ago

after talking with @HighIander, I decided to test if the fluctuation also happened if we use a slightly different charge per particle. I run on GPU a test where each supercell was calculated sequentially. Each ion got a slightly different charge by adding bound electrons for hydrogen to a value between [0,0.1) and cloning electrons from these ions. Electrons also got bound electrons and the trait to calculate the charge based on bound electrons got updated to have the correct sign for the charge for electrons. I used 16ppc quite (4x4) to speed up the simulation. The single bit filter in the filed solver was activated.

https://github.com/psychocoderHPC/picongpu/blob/7761978c400b2d3861c0a2619c6d20d5dd25a0f3/include/picongpu/traits/attribute/GetCharge.hpp#L72-L75 was changed to

        return
            frame::getCharge< typename T_Particle::FrameType >() * -1.0_X *
            ( particle[boundElectrons_] - protonNumber ) *
            weighting;

sbastrakov commented 3 years ago

After some further investigation attemps and discussion today involving quite a few people, we found out an issue. Not sure yet if that is the issue causing this effect. But seems definitely a problem.

The intial temperature in this, and probably some other, setups seems to be chosen so that the corresponding Debye length is not resolved by the grid. (In particular, if no temperature is set at all, no grid could resolve it). This violation does not necessarily lead to wrong results. But it could and a couple of quick tests by @psychocoderHPC to add a minimum temperature so that we resolve Debye length seem to decrease the differences between runs by a lot. This result is so far preliminary, the investigation is to be continued.

I think to confirm it, would also help to take some classical setup for cold-beam instability from the Hockney book (I guess?)

sbastrakov commented 3 years ago

Not sure, but maybe this paper and some of its references are relevant for this issue.

HighIander commented 3 years ago

That's cracy. I never observed large problems with T=0,but obviously it might be dangerous after all. I'll check the references and try to understand why, thx so far!

bussmann commented 3 years ago

That's cracy. I never observed large problems with T=0,but obviously it might be dangerous after all. I'll check the references and try to understand why, thx so far!

T->0 is one of the well known cases where PIC fails conceptually and produces wrong results. If you have never seen this before, you were either extremely lucky or other effects were hiding/dissipating the problem.

What was the Initial temperature for the SMILEI sims?

steindev commented 3 years ago

@HighIander: Maybe of interest too: an original paper by Birdsall et al (https://doi.org/10.1016/0021-9991(80)90171-0) which @PrometheusPi found after we encountered the issue the first time where I remembered the effect to be described in the Hockney book as the cold streaming plasma instability. In Hockney it is described in ch. 7-3-3 The Warm-Plasma Approximation. Back then we saw it in a bunch acceleration where the beam's divergence increased during free propagation. Just as now, the issue is resolved by a higher resolution (i.e. reduced Δx) and/or a higher initial bunch divergence/temperature.

psychocoderHPC commented 3 years ago

Here are my results for the toy setup with 64ppc at timestep 12k (short before first particles left the global volume on the laser side) and timestep 23k (before particles left the volume on the right side of the simulation volume). The plot shows 100 simulations with 70 keV initial temperature and 10 simulations with 0 keV temperature. PIConGPU was running on GPU (V100) with full parallelism and the current deposition strategy StridedCachedSupercells. There is still a fluctuation but the spread is much lower. I move the ball now back to the physicists to correct their setup and to update the documentation of PIConGPU to make all users more aware of these effects.

@sbastrakov is working on a validator to print a warning after the simulation is initialized if needed.

step 12k

11k png

step 23k

23k

steindev commented 3 years ago

I can also recommend section 4.1 of the paper (https://doi.org/10.3847/1538-4357/aa6d13) which describes the problem in short.

n01r commented 3 years ago

@sbastrakov is working on a validator to print a warning after the simulation is initialized if needed.

It would actually be cool if we could factor in all initial conditions for the validation somehow. Like, how the chosen resolution, number of particles, particle shape, activated ionization and collision all affect this issue.

psychocoderHPC commented 3 years ago

To complete the tests:

We checked if ionization(ADK) is helping to reduce the fluctuation

step 23k

psychocoderHPC commented 3 years ago

@steindev ask me to test if reducing the pulse length of the laser (plane wave) is helping too, because reducing the pulse length will end in faster increase of the laser energy.

The simulation is halfing the pulse length

step 23k

bussmann commented 3 years ago

Dear all, I think we need a dedicated approach to testing that brings everybody up to speed.

steindev commented 3 years ago

@HighIander I think we can close this issue, can't we?

psychocoderHPC commented 3 years ago

I opened an issue that we do not forget to add a temperature to the foil LCT example #3438

ComputationalRadiationPhysics / picongpu

Benchmark simulations #3377

Synopsis:

Those are the details:

Results:

esirkepov_tsc_16x16ppc_80k_res384

esirkepov_tsc_4x4ppc_80k_res384

esirkepov_tsc_16ppc_random_80k_res384

esirkepov_tsc_64ppc_random_80k_res384

esirkepov_tsc_16ppc_random_80k_res384_doublePrecision_v100

step 12k

step 23k

step 23k

step 23k