NaNs in E and B after init

iljah commented 7 years ago

Using this code:

    for infilename in sys.argv[1:]:
        infile = pytools.vlsvfile.VlsvReader(infilename)

        mesh_size = infile.get_spatial_mesh_size()
        cells = [c for c in range(1, int(mesh_size[0] * mesh_size[1] * mesh_size[2] + 1))]

        plot_dims = []
        if mesh_size[0] > 1:
            plot_dims.append(mesh_size[0])
        if mesh_size[1] > 1:
            plot_dims.append(mesh_size[1])
        if mesh_size[2] > 1:
            plot_dims.append(mesh_size[2])

        var = infile.read_variable('B', cellids = cells)
        var = numpy.reshape(var, plot_dims + [3])
        if numpy.sum(numpy.isnan(var)) > 0:
            exit('NaNs encountered when plotting B')

I get NaNs from the bulk.0000000.vlsv file. I'm using this config and pretty sure I haven't touched anything related to field solver, how should I debug this?

project = Magnetosphere

[io]
diagnostic_write_interval = 1
write_initial_state = 0
restart_walltime_interval = 21000
number_of_restarts = 1000

system_write_t_interval = 1
system_write_file_name = bulk
system_write_distribution_stride = 0
system_write_distribution_xline_stride = 0
system_write_distribution_yline_stride = 0
system_write_distribution_zline_stride = 0

system_write_t_interval = 25
system_write_file_name = distributions
system_write_distribution_stride = 0
system_write_distribution_xline_stride = 15
system_write_distribution_yline_stride = 15
system_write_distribution_zline_stride = 15

system_write_t_interval = 500
system_write_file_name = fullf
system_write_distribution_stride = 1
system_write_distribution_xline_stride = 0
system_write_distribution_yline_stride = 0
system_write_distribution_zline_stride = 0

[gridbuilder]
x_length = 50
y_length = 1
z_length = 50
x_min = -2.0e8
x_max = 2.0e8
y_min = -4e6
y_max = 4e6
z_min = -2.0e8
z_max = 2.0e8
vx_min = -1.0e6
vx_max = +1.0e6
vy_min = -1.0e6
vy_max = +1.0e6
vz_min = -1.0e6
vz_max = +1.0e6
vx_length = 10
vy_length = 10
vz_length = 10

timestep_max = 10000000
t_max = 1

[loadBalance]
rebalanceInterval = 10

[variables]
output = Rho
output = B
output = VolB
output = E
output = VolE
output = Pressure
output = RhoV
output = BoundaryType
output = MPIrank
output = derivs
output = BVOLderivs
output = BoundaryLayer
output = BackgroundB
output = PerturbedB
output = LBweight
output = MaxVdt
output = MaxRdt
output = MaxFieldsdt
output = Blocks
output = PTensor
output = fSaved
diagnostic = Blocks
diagnostic = RhoLossAdjust
diagnostic = RhoLossVelBoundary
diagnostic = MaxDistributionFunction
diagnostic = MinDistributionFunction

[boundaries]
periodic_x = no
periodic_y = yes
periodic_z = no
boundary = Outflow
boundary = Maxwellian
boundary = Ionosphere

[ionosphere]
centerX = 0.0
centerY = 0.0
centerZ = 0.0
rho = 1.0e6
VX0 = 0.0
VY0 = 0.0
VZ0 = 0.0
radius = 38.0e6
taperRadius = 1e3
precedence = 2
geometry = 3

[outflow]
face = x-
face = z-
face = z+
precedence = 3

[maxwellian]
dynamic = 0
face = x+
file_x+ = sw1.dat
precedence = 4

[sparse]
minValue = 1.0e-13

[Magnetosphere]
T = 100000.0
rho  = 1.0e4

VX0 = -1.0e5
VY0 = 0.0
VZ0 = 0.0

constBgBX = 0.0
constBgBY = 0.0
constBgBZ = 0.0

noDipoleInSW = 0.0

nSpaceSamples = 3
nVelocitySamples = 1

dipoleType = 1

When plotting later times I see an expanding white diamond from earth which is the nans expanding to fill everything. The sum for isnan(B) is 522.

ykempf commented 7 years ago

We haven't been keeping cfg files really properly up to date in the directories...

By eye, I'd say that the phase space threshold is too high and the magnetosphere density too low.

ykempf commented 7 years ago

This one I updated yesterday for master:

project = Magnetosphere

[io]
diagnostic_write_interval = 10
write_initial_state = 0
restart_walltime_interval = 21000
number_of_restarts = 1000

system_write_t_interval = 1
system_write_file_name = bulk
system_write_distribution_stride = 0
system_write_distribution_xline_stride = 0
system_write_distribution_yline_stride = 0
system_write_distribution_zline_stride = 0

[gridbuilder]
x_length = 200
y_length = 1
z_length = 200
x_min = -2.0e8
x_max = 2.0e8
y_min = -1e6
y_max = 1e6
z_min = -2.0e8
z_max = 2.0e8

vx_min = -4.0e6
vx_max = +4.0e6
vy_min = -4.0e6
vy_max = +4.0e6
vz_min = -4.0e6
vz_max = +4.0e6
vx_length = 40
vy_length = 40
vz_length = 40

timestep_max = 10000000
t_max = 100.0

[sparse]
minValue = 1.0e-13

[fieldsolver]
minCFL = 0.4
maxCFL = 0.5
maxSubcycles = 50
ohmGradPeTerm = 0
ohmHallTerm = 2

[Magnetosphere]
constBgBX = 0.0
constBgBY = 0.0
constBgBZ = -5.0e-9

noDipoleInSW = 0.0

[ionosphere]
centerX = 0.0
centerY = 0.0
centerZ = 0.0
radius = 38.0e6
taperRadius = 80.0e6

rho = 1.0e6
VX0 = 0.0
VY0 = 0.0
VZ0 = 0.0

precedence = 2

[Magnetosphere]
T = 100000.0
rho  = 1.0e6

VX0 = -7.5e5
VY0 = 0.0
VZ0 = 0.0

nSpaceSamples = 3
nVelocitySamples = 1

[loadBalance]
rebalanceInterval = 10

[variables]
output = B
output = VolB
output = E
output = VolE
output = Pressure
output = RhoV
output = BoundaryType
output = MPIrank
output = derivs
output = BVOLderivs
output = BoundaryLayer
output = BackgroundB
output = PerturbedB
output = LBweight
output = MaxVdt
output = MaxRdt
output = MaxFieldsdt
output = Blocks
output = PTensor
output = fSaved
output = Rho

diagnostic = Blocks
diagnostic = RhoLossAdjust
diagnostic = RhoLossVelBoundary
diagnostic = MaxDistributionFunction
diagnostic = MinDistributionFunction

[boundaries]
periodic_x = no
periodic_y = yes
periodic_z = no
boundary = Outflow
boundary = Maxwellian
boundary = Ionosphere

[outflow]
precedence = 3
face = x-
face = z-
face = z+

[maxwellian]
dynamic = 0
face = x+
precedence = 4
file_x+ = sw1.dat

ykempf commented 7 years ago

And corresponding density in the solar wind.

iljah commented 7 years ago

Yours has the same minValue = 1.0e-13...

iljah commented 7 years ago

For 2d polar run shouldn't isphere geometry = 3?

ykempf commented 7 years ago

Yes but 100 times higher density if I remember correctly.

On 2 June 2017 10:25:56 EEST, Ilja notifications@github.com wrote:

Yours has the same minValue = 1.0e-13...

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/fmihpc/vlasiator/issues/334#issuecomment-305711005

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

ykempf commented 7 years ago

That is possibly wrong too... Urs, Sebastian, we should think of a cfg archive of some sort, we never keep the ones in the code repo up to date.

On 2 June 2017 10:27:42 EEST, Ilja notifications@github.com wrote:

For 2d polar run shouldn't isphere geometry = 3?

-- You are receiving this because you commented. Reply to this email directly or view it on GitHub: https://github.com/fmihpc/vlasiator/issues/334#issuecomment-305711350

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

iljah commented 7 years ago

I got closer to finding out what's happening: fieldsolver/derivatives.cpp:104 Zero density in spatial cell 0

I'll have to dig deeper but at least in that function rghtNbrID is never assigned a value so debug print probably has wrong id.

iljah commented 7 years ago

A bit more progress: fieldsolver/derivatives.cpp:95 Zero density, in -x neighbor of cell 2448, at 1.68e+08, -4e+06, 1.84e+08

My sim is 50x50 cells so 2447 that has 0 density is 4th layer from outer boundary. Does that mean that setVelocityBlock in project.cpp fails to initialize that cell with any blocks?

iljah commented 7 years ago

Moar progress, now I get: fieldsolver/ldz_volume.cpp:133 Illegal value: -nan

iljah commented 7 years ago

Currently this is where the nan first appears (fieldsolver/ldz_electric_field.cpp), I'll continue tracking:

   #ifndef FS_1ST_ORDER_SPACE
      // 2nd order terms
      Ey_SW += +HALF*((Bz_S - HALF*dBzdx_S)*(-derivs_SW[fs::dVxdx] - derivs_SW[fs::dVxdz]) - dBzdx_S*Vx0 + SIXTH*dBzdy_S*derivs_SW[fs::dVxdy]);

ykempf commented 7 years ago

Most likely some zero-density gets into a denominator. Zero density at the beginning usually means the vspace didn't "catch" your population.

iljah commented 7 years ago

Nah I got inf for density, current debug checks only work for non-negative densities...

iljah commented 7 years ago

And using CHECK_FLOAT(rght[cp::RHO]) around line 100 in derivatives.cpp gives a segfault.

iljah commented 7 years ago

I've gotten to point where I see inf density after:

rght = cellCache.cells[fs_cache::calculateNbrID(1  ,1  ,1+1)]->parameters;

in derivatives.cpp at "Calculate z-derivatives". Do you think cell cache could be responsible? Or maybe this data hasn't been initialized yet? Where would you check next?

ykempf commented 7 years ago

Inf density is a rare thing, I must say, in our experience... How early are you getting this? At initialisation still? What type of a system is this? Are the system boundaries properly set in the config file (cf. periodicity vs. which face gets what type of boundary and such)?

iljah commented 7 years ago

I'll have to check how early exactly but almost certainly before first dt is computed. Config file is same as I used above but adjusted based on your version. I'll send more details tomorrow.

iljah commented 7 years ago

This is last thing printed to logfile and there are no result files yet: (INIT): Set initial state.

ykempf commented 7 years ago

Can you try to check your config file with the tools/check_vlasiator_cfg.sh script? That might give hints at wrong/missing parameters. I really don't see how to get inf density, unless deliberately dividing by 0. Check that spatial sampling rates are 2 or more?

ykempf commented 7 years ago

nSpaceSamples and nVelocitySamples, wherever these parameters are useful. Setting them to 1 or 0 is dangerous.

iljah commented 7 years ago

Here's what I get, looks like velocitysamples was 1 and changing to 2 the problem seems to go away or at least doesn't occur as early:

------------------------------------------------------------------------------------------------------------
Available unused options
------------------------------------------------------------------------------------------------------------
AMR.coarsen_limit (=0.5)
AMR.max_velocity_level (=0)
AMR.refine_limit (=1)
AMR.vel_refinement_criterion
bailout.max_memory (=1073741824)
bailout.min_dt (=1e-06)
bailout.write_restart (=1)
dynamic_timestep (=1)
fieldsolver.diffusiveEterms (=1)
fieldsolver.electronTemperature (=0)
fieldsolver.maxWaveVelocity (=1e+20)
fieldsolver.resistivity (=0)
global_config
gridbuilder.dt (=0)
gridbuilder.geometry (=XYZ6D)
hallMinimumRho (=1)
help
ionosphere.reapplyUponRestart (=0)
io.restart_write_path (=./)
io.system_write_path
io.write_as_float (=0)
io.write_restart_stripe_factor (=-1)
loadBalance.algorithm (=RCB)
loadBalance.tolerance (=1.05)
Magnetosphere.dipoleMirrorLocationX (=-1)
Magnetosphere.dipoleScalingFactor (=1)
maxwellian.file_x- (=1)
maxwellian.file_y- (=1)
maxwellian.file_y+ (=1)
maxwellian.file_z- (=1)
maxwellian.file_z+ (=1)
maxwellian.nSpaceSamples (=2)
maxwellian.nVelocitySamples (=5)
maxwellian.reapplyUponRestart (=0)
outflow.faceNoFields
outflow.quench (=1)
outflow.reapplyUponRestart (=0)
outflow.vlasovScheme_face_x- (=Copy)
outflow.vlasovScheme_face_x+ (=Copy)
outflow.vlasovScheme_face_y- (=Copy)
outflow.vlasovScheme_face_y+ (=Copy)
outflow.vlasovScheme_face_z- (=Copy)
outflow.vlasovScheme_face_z+ (=Copy)
Project_common.seed (=42)
propagate_field (=1)
propagate_potential (=0)
propagate_vlasov_acceleration (=1)
propagate_vlasov_translation (=1)
restart.filename
run_config
sparse.blockAddWidthV (=1)
sparse.conserve_mass (=0)
sparse.dynamicAlgorithm (=0)
sparse.dynamicBulkValue1 (=0)
sparse.dynamicBulkValue2 (=0)
sparse.dynamicMinValue1 (=1)
sparse.dynamicMinValue2 (=1)
user_config
variables.dr_backstream_radius (=468621)
variables.dr_backstream_vx (=-500000)
variables.dr_backstream_vy (=0)
variables.dr_backstream_vz (=0)
version
vlasovsolver.maxCFL (=0.99)
vlasovsolver.maxSlAccelerationRotation (=25)
vlasovsolver.maxSlAccelerationSubcycles (=1)
vlasovsolver.minCFL (=0.8)
------------------------------------------------------------------------------------------------------------
No invalid options

ykempf commented 7 years ago

In some places that sample number is taken (n - 1) in a denominator to determine sampling spacing. So there you'd get a division by zero. The parameters left as default in your post above seem ok to me.

iljah commented 7 years ago

Nope, using 2 samples instead of 1 just makes initialization take much longer but I still get inf after rght assignment above.

iljah commented 7 years ago

Problem seems to occur only if x_length, z_length >= 50.

ykempf commented 7 years ago

Can you provide the location or cfg file?

iljah commented 7 years ago

As above:

rght = cellCache.cells[fs_cache::calculateNbrID(1  ,1  ,1+1)]->parameters;

in derivatives.cpp at "Calculate z-derivatives". cfg file is

project = Magnetosphere

[io]
diagnostic_write_interval = 10
write_initial_state = 0
restart_walltime_interval = 21000
number_of_restarts = 1000

system_write_t_interval = 1
system_write_file_name = bulk
system_write_distribution_stride = 0
system_write_distribution_xline_stride = 0
system_write_distribution_yline_stride = 0
system_write_distribution_zline_stride = 0

system_write_t_interval = 25
system_write_file_name = distributions
system_write_distribution_stride = 0
system_write_distribution_xline_stride = 15
system_write_distribution_yline_stride = 15
system_write_distribution_zline_stride = 15

system_write_t_interval = 500
system_write_file_name = fullf
system_write_distribution_stride = 1
system_write_distribution_xline_stride = 0
system_write_distribution_yline_stride = 0
system_write_distribution_zline_stride = 0

[gridbuilder]
x_length = 40
y_length = 1
z_length = 50
x_min = -2.0e8
x_max = 2.0e8
y_min = -4e6
y_max = 4e6
z_min = -2.0e8
z_max = 2.0e8
vx_min = -4.0e6
vx_max = +4.0e6
vy_min = -4.0e6
vy_max = +4.0e6
vz_min = -4.0e6
vz_max = +4.0e6
vx_length = 10
vy_length = 10
vz_length = 10

timestep_max = 10000000
t_max = 100

[loadBalance]
rebalanceInterval = 10

[variables]
output = Rho
output = B
output = VolB
output = E
output = VolE
output = Pressure
output = RhoV
output = BoundaryType
output = MPIrank
output = derivs
output = BVOLderivs
output = BoundaryLayer
output = BackgroundB
output = PerturbedB
output = LBweight
output = MaxVdt
output = MaxRdt
output = MaxFieldsdt
output = Blocks
output = PTensor
output = fSaved
diagnostic = Blocks
diagnostic = RhoLossAdjust
diagnostic = RhoLossVelBoundary
diagnostic = MaxDistributionFunction
diagnostic = MinDistributionFunction

[boundaries]
periodic_x = no
periodic_y = yes
periodic_z = no
boundary = Outflow
boundary = Maxwellian
boundary = Ionosphere

[ionosphere]
centerX = 0.0
centerY = 0.0
centerZ = 0.0
rho = 1e6
VX0 = 0.0
VY0 = 0.0
VZ0 = 0.0
radius = 38e6
taperRadius = 1e3
precedence = 2
geometry = 3

[outflow]
face = x-
face = z-
face = z+
precedence = 3

[maxwellian]
dynamic = 0
face = x+
file_x+ = sw1.dat
precedence = 4

[sparse]
minValue = 1e-15

[fieldsolver]
minCFL = 0.4
maxCFL = 0.5
maxSubcycles = 50
ohmGradPeTerm = 0
ohmHallTerm = 2

[Magnetosphere]
T = 1e5
rho  = 1e6

VX0 = -7e5
VY0 = 0.0
VZ0 = 0.0

constBgBX = 0.0
constBgBY = 0.0
constBgBZ = 0.0

noDipoleInSW = 0.0

nSpaceSamples = 2
nVelocitySamples = 2

dipoleType = 1

although above I have 40x50 cell simulation and that triggers a nan density in a different place, still investigating.

iljah commented 7 years ago

With above config I get nan (std::isnan(cent[cp::RHO_DT2])) after line 73 in derivatives.cpp: creal* cent = cellCache.cells[fs_cache::calculateNbrID(1,1,1)]->parameters;

iljah commented 7 years ago

Note that nan compares false to everything I think so it isn't caught by the check a bit afterwards: cent[cp::RHO] <= 0

ykempf commented 7 years ago

Well I meant run location if accessible to me. Thx for the cfg.

iljah commented 7 years ago

Ah run location is my laptop, projects/Magnetosphere/Magnetosphere.cfg. I sent the source by email.

ykempf commented 7 years ago

Cannot reproduce, your cfg runs smoothly on a single cray node. With some random sw1.dat I copied from some of my own. Check that your sw1.dat is sane?

iljah commented 7 years ago

Check that your sw1.dat is sane?

I used the one in the magnetosphere project directory in package I sent: cat sw1.dat 0.0 1.0e6 1.0e5 -7e5 0 0 0 0 -5e-9

ykempf commented 7 years ago

OK. I didn't actually open your archive... This looks good to me, I have the same values.

I used revision 4c5e4b5f.

iljah commented 7 years ago

Hmm ok, which compiler? I use g++ (GCC) 6.3.1 20161221 (Red Hat 6.3.1-1)

ykempf commented 7 years ago

aprun vlasiator --version

----------- Compilation --------- 
date:       Wed Jun  7 16:54:33 EEST 2017
folder:     /homeappl/home/kempf/vlasiator 
CMP:        CC 
CXXFLAGS:   -DMPICH_IGNORE_CXX_SEEK -O3 -fopenmp -funroll-loops -std=c++11 -W -Wall -Wno-unused -fabi-version=0 -mavx2  -DPAPI_MEM -DUSE_JEMALLOC -DJEMALLOC_NO_DEMANGLE -DPROFILE -DNDEBUG -DACC_SEMILAG_PQM -DTRANS_SEMILAG_PPM  -I/proj/vlasiato/libraries//mpich2/7.2.6/gcc/5.1.0/phiprof/2.0/include  -I/proj/vlasiato/libraries//mpich2/7.2.6/gcc/5.1.0/jemalloc/4.0.4/include  -DDP  -DSPF -DVEC8F_AGNER 
FLAGS:       
INC_MPI:     
INC_DCCRG:  -I/proj/vlasiato/libraries//dccrg/ 
INC_ZOLTAN: -I/opt/cray/trilinos/12.8.1.0/GNU/5.1/x86_64/include 
INC_BOOST:  -I/opt/cray/trilinos/12.8.1.0/GNU/5.1/x86_64/include/boost 

----------- git branch --------- 
  fix_to_inflow_bug_AEA
* master

----------- git log (last 10 commits) --------- 
4c5e4b5f29d40325e1b79985aea92543a1926536 Merge pull request #324 from galfthan/prefetch_for_translation
079b190c66ac6b3e6ef3e987d1e987f755628eb8 Merge pull request #328 from ykempf/fix_to_inflow_bug_AEA
97fad11004f1a23538d066ef7d7a2847b14e8d16 Set second-order derivatives to 0 on layer 1 cells. Fix suggestion to the initial AEA crash.
216871e848498b5c481192a71047f605dae31031 Merge pull request #327 from ursg/flowthrough_b
c8713f9ae6c33b2a314329265bd0fc842be6c54c In flowthrough test, assign B field only to BgB, not twice.
06f5d4df2d2e8380390707d808b9accb64e75456 Merge pull request #326 from ursg/memory_bailout_trigger
dfbe7951c4f32dab70b546e2e0cd1a2ad4eb3e14 GiB instead of GB in memory bailout, and it's default value is 1 EiB
711aeb29172d1f957cae25780da6e276ff5d0248 Use max memory instead of average, default max_memory now 1 Exabyte/Node
77e115dd2829d0432ed8d7059e7c53e44509b9e3 Allow a maximum memory per node bailout trigger to be set.
59e50365878559b93e3b6aec83b7bb76f6dde6f2 Added prefetching for load and store in translation

----------- module list --------- 
Currently Loaded Modulefiles:
  1) modules/3.2.10.5
  2) eswrap/1.3.3-1.020200.1278.0
  3) switch/1.0-1.0502.60522.1.61.ari
  4) slurm
  5) gcc/6.2.0
  6) craype-haswell
  7) craype-network-aries
  8) craype/2.5.9
  9) cray-mpich/7.5.1
 10) totalview-support/1.2.0.13
 11) totalview/2017.0.12
 12) cray-libsci/16.11.1
 13) udreg/2.3.2-1.0502.10518.2.17.ari
 14) ugni/6.0-1.0502.10863.8.29.ari
 15) pmi/5.0.11
 16) dmapp/7.0.1-1.0502.11080.8.76.ari
 17) gni-headers/4.0-1.0502.10859.7.8.ari
 18) xpmem/0.1-2.0502.64982.5.3.ari
 19) job/1.5.5-0.1_2.0502.61936.2.32.ari
 20) dvs/2.5_0.9.0-1.0502.2188.1.116.ari
 21) alps/5.2.4-2.0502.9822.32.1.ari
 22) rca/1.0.0-2.0502.60530.1.62.ari
 23) atp/2.0.5
 24) PrgEnv-gnu/5.2.82
 25) papi/5.5.0.2
 26) cray-trilinos/12.8.1.0

----------- git status --------- 
# On branch master
# Your branch is ahead of 'origin/master' by 4 commits.
#
# Untracked files:
#   (use "git add <file>..." to include in what will be committed)
#
#       vlsvextract_DP
nothing added to commit but untracked files present (use "git add" to track)

----------- git diff ----------

iljah commented 7 years ago

Hmm library install locations imply gcc 5 but loaded module is for gcc 6. I guess that's close enough to my setup. I might try with a different compiler version.

ursg commented 6 years ago

Whoopsie, this was not closed yet, but accidentally auto-closed due to a typo in the text of pull request #345

ykempf commented 6 years ago

In any case, is this still an issue or have you solved it @iljah?

iljah commented 6 years ago

Haven't solved it but it's not an issue anymore, thanks.

Lähettäjä: Yann Pfau-Kempf notifications@github.com Lähetetty: 7. marraskuuta 2017 17:35:35 Vastaanottaja: fmihpc/vlasiator Kopio: Honkonen Ilja (FMI); Mention Aihe: Re: [fmihpc/vlasiator] NaNs in E and B after init (#334)

In any case, is this still an issue or have you solved it @iljahhttps://github.com/iljah?

- You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/fmihpc/vlasiator/issues/334#issuecomment-342520603, or mute the threadhttps://github.com/notifications/unsubscribe-auth/ACAEtB-iQD6GCpSf4agt3ZbdZWNuslN1ks5s0HjHgaJpZM4NrZ7X.

fmihpc / vlasiator

NaNs in E and B after init #334