Open JBorrow opened 3 years ago
@JBorrow do you have the git SHA of an earlier revision where this worked?
Nothing helpful, the version I usually use is quite old.
This is most probably the same issue underlying #37. It should in principle be the same problem, except now it's caught earlier and with a different error message (that has been added since #37 was reported to help picking up this issue earlier at runtime, and with a more meaningful explanation). Please check if the workaround described in the last messages in that ticket (disabling OpenMP-based FOF) removes the problem. Note also that this seems to be a problem only when compiling without MPI support (-DVR_MPI=OFF
).
@JBorrow when you say the version you usually use is quite old, and that in itself could be helpful. If you could report here what exact version was that it would help finding out the real underlying issue.
I am not entirely sure it is a duplicate of #37 but it might. The reason I say this is that I get the code to run with fee3a9f7c472fdd2a3a42aa213ee1f9a8b167d8b but break on cb4336de8421b7bada3e158c44755c41e9fab78b. And both these are fairly recent versions.
The error message is
terminate called after throwing an instance of 'std::runtime_error'
what(): Particle density not positive, cannot continue
If you need full info, the first one was compiled here:
/cosma7/data/dp004/jlvc76/VELOCIraptor/VELOCIraptor-STF/build
/snap7/scratch/dp004/jlvc76/VELOCIraptor/VELOCIraptor-STF/build
Both cases were compiled with Intel 2020. Both have VR_OPENMP
switched ON
and VR_MPI
switched OFF
. Both are run using 28 OMP threads on a single node.
Input is
/snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_ICs/EAGLE_25/eagle_0036.hdf5
/snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_ICs/EAGLE_25/vrconfig_3dfof_subhalos_SO_hydro.cfg
Happy to copy anything over to other places or give you more info is needed.
@MatthieuSchaller interesting... if you say that a previous build worked and now it didn't then I introduced a regression. My guess is that the diagnosis is correct; i.e., there are particles with non-negative densities, leading to -inf
potentials, but that in your case these do not make it to DetermineDenVRatioDistribution
where the crash reported in #37 happened. I still need to corroborate this, so for the time being it's just a guess.
I'm happy to revert the change that adds the check for non-positive densities. That would remove this regression, but OTOH it could mean we are doing wrong calculations silently.
Yes, agreed, it's not ideal. Just thought this might help track things down somehow.
@JBorrow does the latest master without OMP but with MPI give you an alternative that is fast enough?
@JBorrow @MatthieuSchaller please see https://github.com/ICRAR/VELOCIraptor-STF/issues/37#issuecomment-749419902 for a very likely potential fix for this problem. If that fix makes this issue disappear then we can close this ticket.
I'm trying this out in cosma6
at the moment. Sadly the fix mentioned in https://github.com/ICRAR/VELOCIraptor-STF/issues/37#issuecomment-749419902 does not seem to address the problem reported in this issue, although it's probably very similar in nature. At the moment I'm trying to gather more information and seeing if there's anything obvious around https://github.com/ICRAR/VELOCIraptor-STF/commit/8b0cf42a686808d14e9ede750dec57a994c14a87, 9a21987 and https://github.com/ICRAR/VELOCIraptor-STF/commit/08595f84a0a5858962320e50dfc5016b9aea4666 that could help here.
I'm commenting here to to leave a trail: in #60 yet another instance of this problem has been reported. Details can be found throughout the comments, including build configuration details.
Another instance of this issue with hydro runs was reported in https://github.com/ICRAR/VELOCIraptor-STF/issues/37#issuecomment-781771823 for a zoom, non-MPI execution.
@MatthieuSchaller with the recent surge of more cases of errors due to negative potentials (a check I added as part of #37), I'm starting to think maybe I'll turn the check off, or at least turn it into a warning instead of an unrecoverable error, or even make this behavior configurable. In https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-739760984 I argued that invalid results might be end up being generated due to this, but otherwise this problem seems to be affecting it too many areas. Thoughts?
Edit: I just double-checked with Pascal, and he confirmed this check is correct; i.e., all particles at this point of the code should have properly-defined densities.
That would be a worry then. If we can get to that point and find particles that have negative densities it means something went wrong.
I suppose if we now it only happens when OpenMP is on then we can have that issue here closed and taken over by the wider OMP issue.
@MatthieuSchaller do you know of any (hopefully small) dataset + configuration file that can be used to consistently reproduce this error? I went through all the reported occurrences and either the original input and configuration files are gone, or the error occurred as part of a SWIFT + VR run, so I can't find an easy way to actually trigger it.
I can try to reason a bit more about the code and try to advance like that, but having a small, reproducible example would be great.
That's unfortunate. Let me dig out some older test case that was backed up.
Sorry @rtobar; my config at /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs/Run_0 crashes. This just happened now, using the latest maser.
Could you expand on the modules used and cmake flags used?
Currently Loaded Modulefiles:
1) python/3.6.5 5) parallel_hdf5/1.8.20 9) cmake/3.18.1
2) ffmpeg/4.0.2 6) gsl/2.4(default)
3) intel_comp/2018(default) 7) fftw/3.3.7(default)
4) intel_mpi/2018 8) parmetis/4.0.3(default)
cmake -DVR_USE_HYDRO=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_BUILD_TYPE=Release ..; make -j
Can you try disabling OpenMP?
Yeah, it works fine if you run with MPI only. That's an okay fix for now - it's actually faster for my 25s - but the behaviour should not be different between this and OMP.
On 26 Feb 2021, at 13:24, Matthieu Schaller notifications@github.com<mailto:notifications@github.com> wrote:
[EXTERNAL EMAIL]
Can you try disabling OpenMP?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-786645602, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB3Z6G6MUP4SENHG5W5KCOLTA6OHPANCNFSM4UNKJL6Q.
Ok, that's interesting. Some of the cases above were breaking both with and without OMP.
Okay - I was wrong - I get the negative densities in pure MPI only mode (/cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/Run_0, snapshot 0000).
Do we have a way of knowing that the particle is or what object it belongs to to try to understand what the offending setup is?
Yes, in hindsight I should have added more information to the error message. I'll do that and push a commit straight to the master
branch.
BTW, I tried replicating the issue with the data, settings and compilation flags used by @JBorrow, but without success so far. I tried in a local system with 20 cores, so tries a few combinations of OpenMP threads and MPI ranks (1x20, 3x6, 20x1) and none of them produced the error. Maybe I miseed something obvious so I'll re-try tomorrow, but otherwise this will be a bit harder to track down than anticipated. I also tried briefly to reproduce in cosma, but the queues were full and I was trying to go for an interactive job -- next time I'll just submit a job instead.
Thanks. This is small enough to run on the login nodes if that helps (though I will deny having ever said that when questioned by the admin police...)
Could it be the fact I'm using Intel 2018?
Worth checking. I'd think the MPI implementation should make no difference but the compiler may come packed with a different version of OMP.
OK, some more data points after some experiments in cosma6 with the input and configuration files pointed out by @JBorrow. I tried different variations of MPI/OpenMP ranks/threads in an MPI+OpenMP enabled build, which yielded different results. This is congruent with the results of my local experiments for the same inputs, where different rank/thread combinations yielded different outcomes, so the problem clearly depends on the partitioning of the problem.
master
branch), here's the main extract from the logfile (the rank/thread number are 0-based):terminate called after throwing an instance of 'std::runtime_error'
what(): Particle density not positive, cannot continue.
Information for particle 1/1267: id=1, pid=11349692, type=1, pos=(-0.00781215, 0.000332759, -0.0124483), vel=(11.6557, -4.83983, 74.4938), mass=0.000969505, density=0
Information for execution context: MPI enabled=yes, rank=16/28, OpenMP enabled=yes, thread=0/1
[0000] [1445.514] [ info] io.cxx:1292 Saving SO particle lists to halos-2.catalog_SOlist.0
terminate called after throwing an instance of 'std::bad_alloc'
what(): std::bad_alloc
/var/slurm/slurmd/job2998665/slurm_script: Zeile 4: 16915 Abgebrochen builds/53/stf -i /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/Run_0/eagle_0000 -C /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/vrconfig_3dfof_subhalos_SO_hydro.cfg -I 2 -o halos-2
This looks conspicuously close to the work added in #57 so it might be a new problem, happens within the WriteSOCatalog
function, so it's hopefully easy to fix. I've opened #71 to keep track of this.
EDIT: added 1x28 results, further constrained new error, added link to new issue.
To avoid the issue in #71 I tried running the experiments with the latest master that didn't have the changes introduced in #57, but otherwise had all the other fixes we've introduced since we started working on this problem.
The summary is that all but the 28x1 combination ran to completion. I don't think this sheds more light other than confirming that the issue doesn't depend on the inputs only, but actually on how the domain decomposition is handled. All logs are available below, but here's a summary of the successes:
[rtobar@bolano ~]---> grep "VELOCIraptor finished in" mpi-*
mpi-1-openmp-20.log:[0000] [1310.405] [ info] main.cxx:572 VELOCIraptor finished in 21.839 [min]
mpi-1-openmp-28.log:[0000] [1341.484] [ info] main.cxx:572 VELOCIraptor finished in 22.357 [min]
mpi-4-openmp-7.log:[0000] [ 709.688] [ info] main.cxx:572 VELOCIraptor finished in 11.828 [min]
mpi-4-openmp-7.log:[0001] [ 709.720] [ info] main.cxx:572 VELOCIraptor finished in 11.828 [min]
mpi-4-openmp-7.log:[0002] [ 709.760] [ info] main.cxx:572 VELOCIraptor finished in 11.829 [min]
mpi-4-openmp-7.log:[0003] [ 709.778] [ info] main.cxx:572 VELOCIraptor finished in 11.829 [min]
mpi-7-openmp-4.log:[0000] [ 642.523] [ info] main.cxx:572 VELOCIraptor finished in 10.708 [min]
mpi-7-openmp-4.log:[0003] [ 642.540] [ info] main.cxx:572 VELOCIraptor finished in 10.708 [min]
mpi-7-openmp-4.log:[0005] [ 642.546] [ info] main.cxx:572 VELOCIraptor finished in 10.709 [min]
mpi-7-openmp-4.log:[0001] [ 642.573] [ info] main.cxx:572 VELOCIraptor finished in 10.709 [min]
mpi-7-openmp-4.log:[0002] [ 642.583] [ info] main.cxx:572 VELOCIraptor finished in 10.709 [min]
mpi-7-openmp-4.log:[0004] [ 642.581] [ info] main.cxx:572 VELOCIraptor finished in 10.709 [min]
mpi-7-openmp-4.log:[0006] [ 642.667] [ info] main.cxx:572 VELOCIraptor finished in 10.711 [min]
The failure in 28x1 is exactly the same as in the previous test, so at least is fully reproducible:
terminate called after throwing an instance of 'std::runtime_error'
what(): Particle density not positive, cannot continue.
Information for particle 1/1267: id=1, pid=11349692, type=1, pos=(-0.00781215, 0.000332759, -0.0124483), vel=(11.6557, -4.83983, 74.4938), mass=0.000969505, density=0
Information for execution context: MPI enabled=yes, rank=16/28, OpenMP enabled=yes, thread=0/1
mpi-7-openmp-4.log mpi-4-openmp-7.log mpi-28-openmp-1.log mpi-1-openmp-28.log mpi-1-openmp-20.log
And just another update: this can also be reproduced in a few seconds with much smaller files and less ranks (13 was the minimum for this input/configuration in cosma):
$> OMP_NUM_THREADS=1 salloc -p cosma6 -A dp004 -N 1 --exclusive -t 1:00:00 mpirun -np 13 builds/53/stf -C /cosma7/data/dp004/jlvc76/BAHAMAS/Roi_run/vrconfig_3dfof_subhalos_SO_hydro.cfg -i /cosma7/data/dp004/jlvc76/BAHAMAS/Roi_run/baham_0036 -I 2 -o halos
....
terminate called after throwing an instance of 'std::runtime_error'
what(): Particle density not positive, cannot continue.
Information for particle 2473/6517: id=2473, pid=656947, type=1, pos=(0.733834, 0.93133, -0.427506), vel=(-274.327, -45.4808, 313.547), mass=0.552186, density=0
Information for execution context: MPI enabled=yes, rank=0/13, OpenMP enabled=yes, thread=0/1
...
I can reproduce this locally too, so hopefully this will open the door to proper debugging and further advances.
How is it possible that the particles have negative positions?
We box-wrap when writing the snapshots so that should be clean. The numbers also look too large for a rounding error.
Yes, I've checked in the snapshots and there are no particles outside of the box.
Could it be an error in the size of the box (as this only seems to happen to me, anecdotally, for z!=0) and then an incorrect re-wrap?
The last example above is at z=0.
Ah. Sorry, you're right. The original example is at z=0 but the ones recently have all been at z=5. VR then happily runs on the z=3 snapshots from the same simulations, though.
I remember Pascal telling me once that the position field in a Particle object is not a fixed value; during the lifetime of a run they are continuously updated, for example, to be relative to the centre of gravity of the group, or to other reference points.
So that may not be the smoking gun we think it is. I suppose if the particles are put in the frame of the centre of mass then half the particles or so will have negative coordinates.
A small update: with the smaller test case mentioned in https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-791198771, I found that the particle in question has density=0
because this particle didn't have its density calculated, not because the result of the calculation was 0. This is a small, but important piece of information, and it should help finding out what's wrong.
I was also able to reproduce the error earlier in the code, so instead of it failing in GetDenVRatio
I can see the same error for the same particle (same pid and type, but different id/pos/vel values because of the dynamic nature of these fields) in the same rank at the end of GetVelocityDensityApproximative
, which is where densities are calculated. This brings the cause and effect much closer in time, hopefully making it easier to analyse the problem.
Interesting. Could this particle be far from the centre of its object and hence be part of the spherical over-density but not of any actual sub-structure?
The case /cosma/home/dc-borr1/c6dataspace/XL_wave_1/runs/Run_8
for snapshot 6 (z=0) is a new crashing case. This breaks for both MPI and OpenMP only runs. Vr version, config log, etc, all linked in the directory under the velociraptor_submit_0006.slurm file.
I could finally dedicate some more time today to this problem.
As mentioned previously in https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-800028795 some particles end up with density=0
because their density is not computed, not because the result of the computation is 0. Particles are not iterated directly, but instead they are first grouped into leafnode
structures (I'm assuming these are the leaf nodes of the KDTree containing the particles). The code then iterates over the leafnode
structures, and for each leafnode
the contained particles are iterated over.
Using the smaller reproducible example mentioned in https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-791198771 I've now observed how this plays out. When MPI is enabled density calculation is a two-step process:
leafnode
structures to iterate over its particles. Before calculating densities it first checks if there is some overlap with particles from a different rank, and if so the leafnode
is skipped. For those that are not skipped, the contained particles have their densities calculated.leafnode
s that were skipped in the first step.Note that the extra overlapping checks and particle communication happens only if the Local_velocity_density_approximate_calculation
is ==1
. but if a greater value is given (e.g., 2
) then only the first step is taken and no local leafnode
structures are skipped. So a workaround could be to set this value to 2
, although presumably some results will change.
The problem I've reproduced happens when during the second step particle information is exchanged, but a rank doesn't receive any particle information. In such cases the code explicitly skips any further calculations, and any leafnode
that were skipped in the first step are not processed, leading to particles whose densities are not calculated. This sounds like a bug either in the calculation of the overlaps (there are two flavours depending on whether MPI mesh decomposition is used, I'm using it), or in the code exchanging particle information. In other words, if overlapping is found then particle information is expected to arrive. Alternatively it might be correct to not expect particle information to arrive (only sent) in certain scenarios, in which case the code should treat such leafnode
structures again as purely local and process them.
I'm not really sure what's the best to do here. @pelahi, given the situation described above would you be able to point us in the right direction?
This analysis is consistent with at least the MPI-enabled crash @JBorrow reported in the comment above. In there one can see:
[0026] [ 622.342] [debug] localfield.cxx:942 Searching particles in other domains 0
The giveaway here is the 0
(as in "no overlapping particles were imported into my rank"). This is what leads then to the fatal crash in rank 26:
terminate called after throwing an instance of 'std::runtime_error'
what(): Particle density not positive, cannot continue.
Information for particle 1377/4476: id=1377, pid=9634564, type=1, pos=(0.0820553, -0.0173851, -0.0163657), vel=(13.4933, -32.6683, -5.64264), mass=0.000969505, density=0
Information for execution context: MPI enabled=yes, rank=26/28, OpenMP enabled=yes, thread=0/1
The crash in the non-MPI, OpenMP case might be yet a different problem.
Hi @rtobar , I'll look into this later today. I think it is likely a simple logic fix.
Thanks to @pelahi there is now a new commit on the issue-53
branch (371b685) that addresses the problem highlighted on my last comment. The underlying problem was that the code was incorrectly assuming that if during the first processing step one the leafnode
structures was skipped then at least some particle data had to be received from the other ranks. This assumption was incorrect: when no data was received but there were leafnode
structures that were skipped during the first step, then the skipped structures still had to be processed locally during the second step. This fix applies
With this patch I can now run the small defective test case until the end, meaning that all particles have their densities calculated.
I tried also reproducing the last crash (MPI case) reported by @JBorrow and with the patch I can get past the original problem. Later on during the execution the code crashes again though, but seems like an unrelated issue (more like #71 or #73, but haven't dug on it).
With this latest fix I feel fairly confident the underlying problem is finally gone for the MPI-enabled cases, but there still remains the MPI-disabled problem @JBorrow is still having, and that I think has happened in a few other places. So while I'll put the latest fix on the master
branch, I think we can't still call it a day here.
I like the sound of this!
Thanks for your hard work!
Just a quick update: I reproduced the MPI-disabled crash that @JBorrow experimented in cosma6 with the same dataset. Here's a small backtrace:
(gdb) bt
#0 0x00007f9f04af2387 in raise () from /lib64/libc.so.6
#1 0x00007f9f04af3a78 in abort () from /lib64/libc.so.6
#2 0x00007f9f057261e5 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3 0x00007f9f05723fd6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:47
#4 0x00007f9f05722f99 in __cxa_call_terminate (ue_header=ue_header@entry=0x7f99d4de90a0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#5 0x00007f9f05723908 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=2, exception_class=5138137972254386944, ue_header=0x7f99d4de90a0, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:676
#6 0x00007f9f050b5eb3 in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x7f99d4de90a0, context=context@entry=0x7f9a9cbe09a0) at ../../../libgcc/unwind.inc:62
#7 0x00007f9f050b66de in _Unwind_Resume (exc=0x7f99d4de90a0) at ../../../libgcc/unwind.inc:230
#8 0x0000000000572846 in GetDenVRatio(Options&, long long, NBody::Particle*, long long, GridCell*, Math::Coordinate*, Math::Matrix*) ()
#9 0x0000000000571e4c in GetDenVRatio(Options&, long long, NBody::Particle*, long long, GridCell*, Math::Coordinate*, Math::Matrix*) ()
#10 0x00000000005b0390 in PreCalcSearchSubSet(Options&, long long, NBody::Particle*&, long long) ()
#11 0x000000000058eb50 in SearchSubSub(Options&, long long, std::vector<NBody::Particle, std::allocator<NBody::Particle> >&, long long*&, long long&, long long&, PropData*) ()
No surprises here, this is where the code has always been crashing. However with the latest changes that have been integrated into the master
branch there is a double-check for density != 0
in all particles at the end of GetVelocityDensityApproximative
(the function where densities are calculated) to crash earlier rather than later. This means that the problem we are facing here is different: the particles that have their densities = 0
have them as such not because they were skipped by GetVelocityDensityApproximative
, but because they never made it into the call. With more time I'll try to continue digging, and hopefully I'll also find a smaller, reproducible example that I can use locally to better pin down what's going on.
Another clue here, from running on COSMA-8 (this has 128 cores/node). The code was far more likely to crash when running on all 128 (rather than the 32 that I resubmitted with) cores, in the MPI-only configuration, with this problem.
Hi, new to the party, but I'm finding this behaviour for commit 8f380fc
. A difference here is this is a zoom run, with VR configured with VR_ZOOM_SIM=ON
(mpi off, omp on) on cosma7. Just found this issue, and still need to go through the whole thread to see if any of the suggestions fix things, but thought worth reporting here for now
terminate called after throwing an instance of 'vr::non_positive_density' what(): Particle density not positive, cannot continue.
Particle information: id=0, pid=2604476, type=1, pos=(-0.00540459, 0.00496019, -0.00370049), vel=(-4.63392, 13.0163, -20.7466), mass=0.000121167, density=0 Execution context: MPI enabled=no, OpenMP enabled=yes, thread=0/1[ 145.889] [debug] search.cxx:2719 Substructure at sublevel 1 with 634 particles
Welcome to the party...
The current "workaround" to just get somewhere with production runs is to toggle MPI and OMP on/off. One combination might be lucky...
Apart from that, your simulation is likely quite small so could you give the directory on cosma, as well as config file? Might be a useful small test to see what may be going on. The more problematic examples we have the more likely we are to identify the issue.
Welcome to the party...
The current "workaround" to just get somewhere with production runs is to toggle MPI and OMP on/off. One combination might be lucky...
Thanks, configuring with MPI on seems to be working fine for me now.
Apart from that, your simulation is likely quite small so could you give the directory on cosma, as well as config file? Might be a useful small test to see what may be going on. The more problematic examples we have the more likely we are to identify the issue.
yes the zooms seem to have a high hit-rate with this bug (3/3 of the last I've tried), the last one I was using is here for now:
/cosma7/data/dp004/wmfw23/colibre_dust/runs/zooms/Auriga/h1/data/snapshot_0013.hdf5
Describe the bug
I've been trying to run the latest(ish) master of VR on some SWIFT outputs (on COSMA7), and I've been getting a couple of odd crashes.
To Reproduce
Version cb4336de8421b7bada3e158c44755c41e9fab78b.
Ran on snapshots under
/snap7/scratch/dp004/dc-borr1/new_randomness_runs/runs/Run_*
Log files
STDOUT:
STDERR:
Environment (please complete the following information):
cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=native" -DVR_MPI=OFF -DVR_HDF5=ON -DVR_ALLOWPARALLELHDF5=ON -DVR_USE_HYDRO=ON