fmihpc / vlasiator

Vlasiator - ten letters you can count on
https://www.helsinki.fi/en/researchgroups/vlasiator
Other
45 stars 37 forks source link

disabling NDEBUG flag leads to crash on voima #155

Closed ohannuks closed 9 years ago

ohannuks commented 9 years ago

hannukse@voima-login1:/lustre/tmp/hannukse/population_runs> aprun -n 1 vlasiator --run_config MultiPeak.cfg _pmiu_daemon(SIGCHLD): [NID 00008] [c0-0c0s2n0] [Tue May 26 12:50:53 2015] PE RANK 0 exit signal Segmentation fault Application 12139762 exit codes: 139 Application 12139762 resources: utime ~1s, stime ~0s, Rss ~3968, inblocks ~16011, outblocks ~43639 hannukse@voima-login1:/lustre/tmp/hannukse/population_runs>

galfthan commented 9 years ago

Do you mean disabling DNDEBUG flag?

ohannuks commented 9 years ago

yup, tested with MultiPeak only so far, now running Magnetosphere.cfg

ykempf commented 9 years ago

Maybe that loads the FP exception tool, which is known to crash in zoltan LB for example?

ohannuks commented 9 years ago

Same happens with Magnetosphere.cfg so it's not project independent. The error message is different though:

hannukse@voima-login1:/lustre/tmp/hannukse/population_runs> aprun -n 20 vlasiator --run_config Magnetosphere.cfg 
Parameter data file (sw1.dat) has 1 values
[NID 00008] 2015-05-26 13:03:05 Apid 12139779: initiated application termination 
Application 12139779 exit codes: 1
Application 12139779 resources: utime ~11s, stime ~1s, Rss ~12100, inblocks ~16000, outblocks ~43640

Both work with the DNDEBUG flag.

galfthan commented 9 years ago

Check what yann said.

Then compile it with -g, ulimit -c unlimited run gdb vlasiator core bt report here

You can also check it with totalview

ykempf commented 9 years ago

If it is not the FPE tool (which I nevertheless suspect strongly) it might well be that some real check somewhere exits although it would be benign (maybe we have a check before initialising a value and never paid attention in further development).

ohannuks commented 9 years ago

It seems to be in the adjust_velocity_blocks.

I'll also debug with totalview

[New LWP 9161]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `vlasiator --run_config MultiPeak.cfg'.
Program terminated with signal 11, Segmentation fault.
#0  je_free (ptr=0xe47910) at include/jemalloc/internal/arena.h:532
532         return (arena_mapbitsp_read(arena_mapbitsp_get(chunk, pageind)));
(gdb) bt
#0  je_free (ptr=0xe47910) at include/jemalloc/internal/arena.h:532
#1  0x0000000000403ea1 in operator delete (p=0xe47910) at memoryallocation.cpp:35
#2  0x0000000000820f85 in (anonymous namespace)::run (p=<optimized out>)
    at ../../../../cray-gcc-4.8.2/libstdc++-v3/libsupc++/atexit_thread.cc:66
#3  0x00000000008867ed in __run_exit_handlers (status=1, listp=0xdf8e00 <__exit_funcs>,
    run_list_atexit=true) at exit.c:78
#4  0x0000000000886843 in exit (status=14973200) at exit.c:100
#5  0x00000000004eb471 in spatial_cell::SpatialCell::adjust_velocity_blocks (this=0x2aaaab05a010,
    spatial_neighbors=..., doDeleteEmptyBlocks=true) at spatial_cell.cpp:104
#6  0x000000000048146d in adjustVelocityBlocks(dccrg::Dccrg<spatial_cell::SpatialCell, dccrg::Cartesian_Geometry>&, std::vector<unsigned long, std::allocator<unsigned long> > const&, bool) [clone ._omp_fn.4]   
    () at grid.cpp:418
#7  0x000000000047eb40 in adjustVelocityBlocks (mpiGrid=..., cellsToAdjust=...,
    doPrepareToReceiveBlocks=true) at grid.cpp:396
#8  0x000000000047c1c9 in initializeGrid (argn=3, argc=0x7fffffff9e98, mpiGrid=..., sysBoundaries=...,   
    project=...) at grid.cpp:165
#9  0x00000000004c3af3 in main (argn=3, args=0x7fffffff9e98) at vlasiator.cpp:282
ohannuks commented 9 years ago

According to totalview the crash is here (adjust velocity blocks) :

         #ifdef DEBUG_SPATIAL_CELL
            if (blockGID == invalid_global_id())
               cerr << "Got invalid block at " << __FILE__ << ' ' << __LINE__ << endl; exit(1);               
         #endif
galfthan commented 9 years ago

The crash is caused by jemalloc after this "unplanned" exit. That is a bit nasty, but more worrying is why the if statement would be entered at all. At that point there should not be any invalid global id's. This suggests velocity mesh has a bug. Is this in master or your own branch where you touch the meshes?

ykempf commented 9 years ago

Not the same crash location in gdb and totalview. Are you running serial?

ohannuks commented 9 years ago

Its a branch but I recently just branched off from master, I hadnt made any changes to the code except including a header file. I did see that it crashed on master as well. I can call to gdb there as well. I'm running it in serial for the MultiPeak.

ykempf commented 9 years ago

All right now I had time to cast a glance at the code, the DNDEBUG and CATCH_FPE flags are independent of each other.

ohannuks commented 9 years ago

ok great! One less thing to be concerned about

ykempf commented 9 years ago

OK I can reproduce too, the question is whether this is benign or not...

ykempf commented 9 years ago

Happens when setting the initial state still.

galfthan commented 9 years ago

Well I do not like it. I cannot debug it now since I prepare some stuff for training. The key question is why it finds an invalid global id in the first place.

ykempf commented 9 years ago

@sandroos, @ohannuks, problem identified.

#ifdef DEBUG_SPATIAL_CELL
   if (blockGID == invalid_global_id())
      cerr << "Got invalid block at " << __FILE__ << ' ' << __LINE__ << endl; exit(1);
#endif

Left as an exercise to the OP. Hints: One instruction per line and always put curly brackets for if statements.

ohannuks commented 9 years ago

Made a pull request https://github.com/fmihpc/vlasiator/pull/156

iljah commented 9 years ago

| #ifdef DEBUG_SPATIAL_CELL if (blockGID == invalid_global_id()) cerr << "Got invalid block at " << FILE << ' ' << LINE << endl; exit(1);

endif

Shouldn't this be checked always instead of onle when debugging?

iljah commented 9 years ago

if (blockGID == invalid_global_id()) cerr << "Got invalid block at " << FILE << ' ' << LINE << endl; exit(1); Left as an exercise to the OP. Hints: One instruction per line and /always/ put curly brackets for if statements.

Lol, I did wonder why a debugger was needed to identify where the program was exiting, this is a good explanation :)