Closed ohannuks closed 9 years ago
Do you mean disabling DNDEBUG flag?
yup, tested with MultiPeak only so far, now running Magnetosphere.cfg
Maybe that loads the FP exception tool, which is known to crash in zoltan LB for example?
Same happens with Magnetosphere.cfg so it's not project independent. The error message is different though:
hannukse@voima-login1:/lustre/tmp/hannukse/population_runs> aprun -n 20 vlasiator --run_config Magnetosphere.cfg
Parameter data file (sw1.dat) has 1 values
[NID 00008] 2015-05-26 13:03:05 Apid 12139779: initiated application termination
Application 12139779 exit codes: 1
Application 12139779 resources: utime ~11s, stime ~1s, Rss ~12100, inblocks ~16000, outblocks ~43640
Both work with the DNDEBUG flag.
Check what yann said.
Then compile it with -g, ulimit -c unlimited run gdb vlasiator core bt report here
You can also check it with totalview
If it is not the FPE tool (which I nevertheless suspect strongly) it might well be that some real check somewhere exits although it would be benign (maybe we have a check before initialising a value and never paid attention in further development).
It seems to be in the adjust_velocity_blocks.
I'll also debug with totalview
[New LWP 9161]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `vlasiator --run_config MultiPeak.cfg'.
Program terminated with signal 11, Segmentation fault.
#0 je_free (ptr=0xe47910) at include/jemalloc/internal/arena.h:532
532 return (arena_mapbitsp_read(arena_mapbitsp_get(chunk, pageind)));
(gdb) bt
#0 je_free (ptr=0xe47910) at include/jemalloc/internal/arena.h:532
#1 0x0000000000403ea1 in operator delete (p=0xe47910) at memoryallocation.cpp:35
#2 0x0000000000820f85 in (anonymous namespace)::run (p=<optimized out>)
at ../../../../cray-gcc-4.8.2/libstdc++-v3/libsupc++/atexit_thread.cc:66
#3 0x00000000008867ed in __run_exit_handlers (status=1, listp=0xdf8e00 <__exit_funcs>,
run_list_atexit=true) at exit.c:78
#4 0x0000000000886843 in exit (status=14973200) at exit.c:100
#5 0x00000000004eb471 in spatial_cell::SpatialCell::adjust_velocity_blocks (this=0x2aaaab05a010,
spatial_neighbors=..., doDeleteEmptyBlocks=true) at spatial_cell.cpp:104
#6 0x000000000048146d in adjustVelocityBlocks(dccrg::Dccrg<spatial_cell::SpatialCell, dccrg::Cartesian_Geometry>&, std::vector<unsigned long, std::allocator<unsigned long> > const&, bool) [clone ._omp_fn.4]
() at grid.cpp:418
#7 0x000000000047eb40 in adjustVelocityBlocks (mpiGrid=..., cellsToAdjust=...,
doPrepareToReceiveBlocks=true) at grid.cpp:396
#8 0x000000000047c1c9 in initializeGrid (argn=3, argc=0x7fffffff9e98, mpiGrid=..., sysBoundaries=...,
project=...) at grid.cpp:165
#9 0x00000000004c3af3 in main (argn=3, args=0x7fffffff9e98) at vlasiator.cpp:282
According to totalview the crash is here (adjust velocity blocks) :
#ifdef DEBUG_SPATIAL_CELL
if (blockGID == invalid_global_id())
cerr << "Got invalid block at " << __FILE__ << ' ' << __LINE__ << endl; exit(1);
#endif
The crash is caused by jemalloc after this "unplanned" exit. That is a bit nasty, but more worrying is why the if statement would be entered at all. At that point there should not be any invalid global id's. This suggests velocity mesh has a bug. Is this in master or your own branch where you touch the meshes?
Not the same crash location in gdb and totalview. Are you running serial?
Its a branch but I recently just branched off from master, I hadnt made any changes to the code except including a header file. I did see that it crashed on master as well. I can call to gdb there as well. I'm running it in serial for the MultiPeak.
All right now I had time to cast a glance at the code, the DNDEBUG and CATCH_FPE flags are independent of each other.
ok great! One less thing to be concerned about
OK I can reproduce too, the question is whether this is benign or not...
Happens when setting the initial state still.
Well I do not like it. I cannot debug it now since I prepare some stuff for training. The key question is why it finds an invalid global id in the first place.
@sandroos, @ohannuks, problem identified.
#ifdef DEBUG_SPATIAL_CELL
if (blockGID == invalid_global_id())
cerr << "Got invalid block at " << __FILE__ << ' ' << __LINE__ << endl; exit(1);
#endif
Left as an exercise to the OP. Hints: One instruction per line and always put curly brackets for if statements.
Made a pull request https://github.com/fmihpc/vlasiator/pull/156
| #ifdef DEBUG_SPATIAL_CELL if (blockGID == invalid_global_id()) cerr << "Got invalid block at " << FILE << ' ' << LINE << endl; exit(1);
endif
Shouldn't this be checked always instead of onle when debugging?
if (blockGID == invalid_global_id()) cerr << "Got invalid block at " << FILE << ' ' << LINE << endl; exit(1); Left as an exercise to the OP. Hints: One instruction per line and /always/ put curly brackets for if statements.
Lol, I did wonder why a debugger was needed to identify where the program was exiting, this is a good explanation :)
hannukse@voima-login1:/lustre/tmp/hannukse/population_runs> aprun -n 1 vlasiator --run_config MultiPeak.cfg _pmiu_daemon(SIGCHLD): [NID 00008] [c0-0c0s2n0] [Tue May 26 12:50:53 2015] PE RANK 0 exit signal Segmentation fault Application 12139762 exit codes: 139 Application 12139762 resources: utime ~1s, stime ~0s, Rss ~3968, inblocks ~16011, outblocks ~43639 hannukse@voima-login1:/lustre/tmp/hannukse/population_runs>