ykempf commented 9 years ago

Since 16f843bc there is significant differences after one step due to threading.

How to reproduce:

From 16f843bc onwards;
Turn off field solver and translation;
Propagate for 1 step with more than one thread per process;
vlsvdiff_DP on rho gives O(10⁻⁷).

Extra notes:

The initial state has a diff of exactly 0;
Not sensitive to DPF or SPF;
If translation is on, the diff is ~50 times bigger (I guess it amplifies the error introduced);
Pure MPI does not show the problem, hence it is related to threading.

Test cfg below (by no means minimal, can be stripped down further), runs on a single node in a few seconds, then use vlsvdiff_DP <...> rho 0.

project = Magnetosphere dynamic_timestep = 0 propagate_vlasov_translation = 0 propagate_vlasov_acceleration = 1 propagate_fields = 0

[io] diagnostic_write_interval = 10 write_initial_state = 0 restart_walltime_interval = -1 # 85000 number_of_restarts = 5

system_write_t_interval = 0.001 system_write_file_name = bulk system_write_distribution_stride = 0 system_write_distribution_xline_stride = 25 system_write_distribution_yline_stride = 1 system_write_distribution_zline_stride = 25

[bailout] write_restart = 0

[gridbuilder] x_length = 75 y_length = 1 z_length = 75 x_min = -110e6 x_max = 220e6 y_min = -2.2e6 y_max = 2.2e6 z_min = -165e6 z_max = 165e6 vx_min = -4.02e6 vx_max = +4.02e6 vy_min = -4.02e6 vy_max = +4.02e6 vz_min = -4.02e6 vz_max = +4.02e6 vx_length = 20 vy_length = 20 vz_length = 20

timestep_max = 100

t_max = 0.024 dt = 0.024

[fieldsolver] ohmHallTerm = 2 minCFL = 0.4 maxCFL = 0.45

extraResistivity = 1.0

set maximum velocity to

maxWaveVelocity = 14989622.9 #5% of speed of light...

maxWaveVelocity = 7494811.45 #2.5% of speed of light...

[vlasovsolver] minCFL = 0.8 maxCFL = 0.99 maxSlAccelerationRotation = 22 lorentzHallMinimumRho = 3.0e5

[loadBalance] rebalanceInterval = 50

[variables] output = Rho output = RhoV output = B output = PerturbedB output = E output = HallE output = PTensor output = BoundaryType output = BoundaryLayer output = derivs output = MaxVdt output = MaxRdt output = MaxFieldsdt output = Blocks output = MPIrank output = fSaved diagnostic = Blocks diagnostic = Rho

dr_backstream_radius = 5.0e5 dr_backstream_vx = -7.5e5 dr_backstream_vy = 0.0 dr_backstream_vz = 0.0

[boundaries] periodic_x = no periodic_y = yes periodic_z = no boundary = Outflow boundary = Maxwellian boundary = Ionosphere

[ionosphere] centerX = 0.0 centerY = 0.0 centerZ = 0.0 rho = 1.0e6 VX0 = 0.0 VY0 = 0.0 VZ0 = 0.0 geometry = 1 radius = 30.0e6 taperRadius = 100.0e6 precedence = 2

[outflow] face = x- face = z- face = z+ precedence = 3

[maxwellian] dynamic = 0 face = x+ file_x+ = sw1.dat precedence = 4

[sparse] minValue = 1.0e-14

[Magnetosphere] T = 0.5e6 rho = 1.0e6 rhoTransitionCenter = 13.0e12 rhoTransitionWidth = 2.0e6 dipoleType = 2 dipoleMirrorLocationX = 439.56e6

VX0 = -7.5e5 VY0 = 0.0 VZ0 = 0.0

constBgBX = 0.0 constBgBY = 0.0 constBgBZ = 0.0

noDipoleInSW = 1.0

nSpaceSamples = 1 nVelocitySamples = 3

galfthan commented 9 years ago

Thank's for forking through the versions!

In vlasovmover.cpp on line 280 we use a non-threadsafe rng. That is why we do not get reproducible results, even if we try to get those via the seeding. Can be fixed by replacing with threadsafe rng (used in projects as far as I remember).

ykempf commented 9 years ago

Shall I move the rng in projects to some common files (and make a version for integers) or shall I make a similar one for this location?

galfthan commented 9 years ago

First test if it fixes the issue by adding it random_r & friends directly into vlasovmover.cpp.

It could make sense to make a common class for that, largest benefit would be that then one could easily change the rng throughout the program.

ykempf commented 9 years ago

and avoid code duplication ;) OK I'll test.

sandroos commented 9 years ago

In this particular case it is possible to randomize the mapping order before going to the loop on line 279. Generally speaking the only way to make parallel & multithreaded simulations almost reproducible is to have the master process generate random number seeds for all processes based on a given config file seed value.

Almost reproducible, because Zoltan still picks a non-reproducible seed value, so the mesh is not partitioned exactly the same way between runs.

ykempf commented 9 years ago

or use our reentrant generator with a seed including the cellID, here's my suggestion which fixes the bug, if people are happy I'll commit it:

diff --git a/vlasovsolver/vlasovmover.cpp b/vlasovsolver/vlasovmover.cpp index b8fdc17..ee81695 100644 --- a/vlasovsolver/vlasovmover.cpp +++ b/vlasovsolver/vlasovmover.cpp @@ -246,7 +246,6 @@ void calculateSpatialTranslation(

void calculateAcceleration( dccrg::Dccrg<SpatialCell,dccrg::Cartesian_Geometry>& mpiGrid, Real dt @@ -278,9 +277,23 @@ void calculateAcceleration(

pragma omp parallel for schedule(dynamic,1)

   for (size_t c=0; c<propagatedCells.size(); ++c) {
      const CellID cellID = propagatedCells[c];

//generate pseudo-random order which is always the same irrespectiive of parallelization, restarts, etc
srand(P::tstep + cellID);
uint map_order=rand()%3;
//generate pseudo-random order which is always the same irrespectiive of parallelization, restarts, etc
char rngStateBuffer[256];
random_data rngDataBuffer;
// set seed, initialise generator and get value
memset(&(rngDataBuffer), 0, sizeof(rngDataBuffer));
ifdef _AIX
initstate_r(P::tstep + cellID, &(rngStateBuffer[0]), 256, NULL, &(rngDataBuffer));
int64_t rndInt;
random_r(&rndInt, &rngDataBuffer);
else
initstate_r(P::tstep + cellID, &(rngStateBuffer[0]), 256, &(rngDataBuffer));
int32_t rndInt;
random_r(&rngDataBuffer, &rndInt);
endif
uint map_order=rndInt%3; phiprof::start("cell-semilag-acc"); cpu_accelerate_cell(mpiGrid[cellID],map_order,dt); phiprof::stop("cell-semilag-acc");

galfthan commented 9 years ago

Looks ok. AIX case is not that relevant anymore though but you never know...

Arto: It's the seeding that keeps it reproducible. It is not the nicest way to get hhigh quality random numbers, but for this purpose it is ok. Doing it before the loop would be different, then all cells would be mapped in the same order on one timestep. Now there is more randomness.

ykempf commented 9 years ago

Well, before one goes through the same pain as Ilja had to on that ECMWF machine or whichever it was, let's do it here in this way too...

ykempf commented 9 years ago

I like this automatic closing. ;)

iljah commented 9 years ago

/Almost/ reproducible, because Zoltan still picks a non-reproducible seed value, so the mesh is not partitioned exactly the same way between runs.

Might be reproducible if using graph partitioning with parmetis which does take a seed value: http://www.cs.sandia.gov/zoltan/ug_html/ug_alg_parmetis.html Dunno if it has to be identical on all processes or cannot be or if it only matters on a subset of procs...

fmihpc / vlasiator

Threading causes significant diffs in acceleration #62

timestep_max = 100

extraResistivity = 1.0

set maximum velocity to

maxWaveVelocity = 14989622.9 #5% of speed of light...

pragma omp parallel for schedule(dynamic,1)

ifdef _AIX

else

endif