ICRAR / VELOCIraptor-STF

Galaxy/(sub)Halo finder for N-body simulations
MIT License
5 stars 5 forks source link

Negative densities / other crashes with latest master #53

Open JBorrow opened 3 years ago

JBorrow commented 3 years ago

Describe the bug

I've been trying to run the latest(ish) master of VR on some SWIFT outputs (on COSMA7), and I've been getting a couple of odd crashes.

To Reproduce

Version cb4336de8421b7bada3e158c44755c41e9fab78b.

Ran on snapshots under /snap7/scratch/dp004/dc-borr1/new_randomness_runs/runs/Run_*

Log files

STDOUT:

...
[ 528.257] [debug] search.cxx:2716 Substructure at sublevel 1 with 955 particles
[ 528.257] [debug] unbind.cxx:284 Unbinding 1 groups ...
[ 528.257] [debug] unbind.cxx:379 Finished unbinding in 1 [ms]. Number of groups remaining: 2

STDERR:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Particle density not positive, cannot continue

Environment (please complete the following information):

cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=native" -DVR_MPI=OFF -DVR_HDF5=ON -DVR_ALLOWPARALLELHDF5=ON -DVR_USE_HYDRO=ON

Currently Loaded Modulefiles:
 1) python/3.6.5               5) parallel_hdf5/1.8.20     
 2) ffmpeg/4.0.2               6) gsl/2.4(default)         
 3) intel_comp/2018(default)   7) fftw/3.3.7(default)      
 4) intel_mpi/2018             8) parmetis/4.0.3(default)  
rtobar commented 3 years ago

@james-trayford since the introduction of the "no positive density" check we have found at least three different points in the code that were causing this issue in different ways:

I'd assume your issue is the third, and that by using MPI you reduced the number of OpenMP threads on each rank, thus avoiding the issue. An alternative workaround (I think, not fully certain) would be to set OMP_run_fof to a non-zero value in the configuration file and try again. This disables OpenMP during FOF searching, so things would run fairly slower, but execution would finish.

pelahi commented 3 years ago

Hi @james-trayford, I will have a look using your snapshots. Any chance you could put them somewhere I could access them? @MatthieuSchaller should I email Adrian to see if I can access cosma now that I have changed institutions?

MatthieuSchaller commented 3 years ago

Do you still have access to gadi? I can copy things there, the setup is fairly small.

pelahi commented 3 years ago

Sadly no, but I can request access again.

On Wed, 23 Jun 2021 at 14:40, Matthieu Schaller @.***> wrote:

Do you still have access to gadi? I can copy things there, the setup is fairly small.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-866572154, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC3ZASYXDYP6FHLAYH7SRFLTUF6WNANCNFSM4UNKJL6Q .

MatthieuSchaller commented 3 years ago

Ok. Might be easier to revive your cosma account then. Feel free to email Adrian and cc me in.

MatthieuSchaller commented 3 years ago

Is there anything I can do here to help with this issue? Tests? narrowing down of a use case? etc.

pelahi commented 3 years ago

Hi @james-trayford, could you provide the config options you ran with? I am not encountering the error but it could be something specific. How many MPI ranks and OMP threads did you run with?

stuartmcalpine commented 2 years ago

I have also recently just encountered this.

DMO zoom simulation, no MPI.

Has anyone made any progress with this? Can I help?

rtobar commented 2 years ago

@stuartmcalpine unfortunately no new progress has been made here. I think the best summary of the situation is https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-866507052, where a workaround (not a great one, but should work) is suggested. Older comments go into all the details on how this story has unfolded...

In order to fix this problem the best would be to find a small, quickly-reproducible example. As mentioned in the comment, this seems to be a problem with domain decomposition on high CPU counts, but that's a guess. In any case having the full details of your failure would definitely be a gain.

stuartmcalpine commented 2 years ago

I am doing some zoom runs, no mpi, on-the-fly with swift. For these tests, cosma 8 and 128 threads. The DMO version segfaults on the 4th invocation, and the hydro version later, like the 10th. But the DMO is consistently failing at the same place.

module load intel_comp/2018 intel_mpi/2018 fftw/3.3.7 module load gsl/2.5 hdf5/1.10.3

cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS="-fPIC" -DVR_ZOOM_SIM=ON -DVR_MPI=OFF -DVR_MPI_REDUCE=OFF -DVR_USE_SWIFT_INTERFACE=ON ..

Config file:

vrconfig_3dfof_subhalos_SO_dmo.txt

Last bit of log:

[1726.081] [debug] search.cxx:3982 Getting Hierarchy 23 [1726.081] [debug] search.cxx:4015 Done [1726.726] [ info] substructureproperties.cxx:5047 Sort particles and compute properties of 23 objects [1726.726] [debug] substructureproperties.cxx:5059 Calculate properties using minimum potential particle as reference [1726.726] [debug] substructureproperties.cxx:5062 Sort particles by binding energy [1795.700] [debug] substructureproperties.cxx:5087 Memory report at substructureproperties.cxx:5087@long long *SortAccordingtoBindingEnergy(Options &, long long, NBody::Particle , long long, long long &, long long , PropData *, long long): Average: 70.075 [GiB] Data: 72.269 [GiB] Dirty: 0 [B] Library: 0 [B] Peak: 81.744 [GiB] Resident: 68.478 [GiB] Shared: 8.734 [MiB] Size: 72.370 [GiB] Text: 4.180 [MiB] [1795.701] [debug] substructureproperties.cxx:42 Getting CM [1795.702] [debug] substructureproperties.cxx:320 Done getting CM in 1 [ms] [1795.702] [debug] substructureproperties.cxx:4621 Getting energy [1795.703] [debug] substructureproperties.cxx:4733 Have calculated potentials in 744 [us] [1795.704] [debug] substructureproperties.cxx:5034 Done getting energy in 1 [ms] [1795.704] [debug] substructureproperties.cxx:338 Getting bulk properties [1795.706] [debug] substructureproperties.cxx:2194 Done getting properties in 1 [ms] [1795.706] [debug] substructureproperties.cxx:3219 Done FOF masses in 4 [us] [1795.706] [debug] substructureproperties.cxx:3236 Get inclusive masses [1795.706] [debug] substructureproperties.cxx:3237 with masses based on full SO search (slower) for halos only

Line where it segfaults:

image

image

rtobar commented 2 years ago

@stuartmcalpine your crash actually looks like a different problem, not the one discussed in this GH issue -- but of course that doesn't mean it's not important to get it fixed. Could you open a new issue with the details please so it doesn't get lost? A descriptive title and a link to your comment above should suffice, there's no need to duplicate the whole text/attachments.

For reference, this looks strikingly similar to an issue reported in https://github.com/ICRAR/VELOCIraptor-STF/issues/78 and fixed in https://github.com/ICRAR/VELOCIraptor-STF/commit/53c02896b1d371405f723fa4ccfa86c81f35fef5. It seems here the same situation is happening: 0 values are being written into a zero-size vector because an extra condition is not checked, but the writing isn't probably needed in the first place. Once the new issue is created in GH I'll have a closer look.