Negative densities / other crashes with latest master

JBorrow commented 3 years ago

Describe the bug

I've been trying to run the latest(ish) master of VR on some SWIFT outputs (on COSMA7), and I've been getting a couple of odd crashes.

To Reproduce

Version cb4336de8421b7bada3e158c44755c41e9fab78b.

Ran on snapshots under /snap7/scratch/dp004/dc-borr1/new_randomness_runs/runs/Run_*

Log files

STDOUT:

...
[ 528.257] [debug] search.cxx:2716 Substructure at sublevel 1 with 955 particles
[ 528.257] [debug] unbind.cxx:284 Unbinding 1 groups ...
[ 528.257] [debug] unbind.cxx:379 Finished unbinding in 1 [ms]. Number of groups remaining: 2

STDERR:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Particle density not positive, cannot continue

Environment (please complete the following information):

cmake .. -DCMAKE_CXX_FLAGS="-O3 -march=native" -DVR_MPI=OFF -DVR_HDF5=ON -DVR_ALLOWPARALLELHDF5=ON -DVR_USE_HYDRO=ON

Currently Loaded Modulefiles:
 1) python/3.6.5               5) parallel_hdf5/1.8.20     
 2) ffmpeg/4.0.2               6) gsl/2.4(default)         
 3) intel_comp/2018(default)   7) fftw/3.3.7(default)      
 4) intel_mpi/2018             8) parmetis/4.0.3(default)

MatthieuSchaller commented 3 years ago

@JBorrow do you have the git SHA of an earlier revision where this worked?

JBorrow commented 3 years ago

Nothing helpful, the version I usually use is quite old.

rtobar commented 3 years ago

This is most probably the same issue underlying #37. It should in principle be the same problem, except now it's caught earlier and with a different error message (that has been added since #37 was reported to help picking up this issue earlier at runtime, and with a more meaningful explanation). Please check if the workaround described in the last messages in that ticket (disabling OpenMP-based FOF) removes the problem. Note also that this seems to be a problem only when compiling without MPI support (-DVR_MPI=OFF).

@JBorrow when you say the version you usually use is quite old, and that in itself could be helpful. If you could report here what exact version was that it would help finding out the real underlying issue.

MatthieuSchaller commented 3 years ago

I am not entirely sure it is a duplicate of #37 but it might. The reason I say this is that I get the code to run with fee3a9f7c472fdd2a3a42aa213ee1f9a8b167d8b but break on cb4336de8421b7bada3e158c44755c41e9fab78b. And both these are fairly recent versions.

The error message is

terminate called after throwing an instance of 'std::runtime_error'
  what():  Particle density not positive, cannot continue

If you need full info, the first one was compiled here:

works: /cosma7/data/dp004/jlvc76/VELOCIraptor/VELOCIraptor-STF/build
breaks: /snap7/scratch/dp004/jlvc76/VELOCIraptor/VELOCIraptor-STF/build

Both cases were compiled with Intel 2020. Both have VR_OPENMP switched ON and VR_MPI switched OFF. Both are run using 28 OMP threads on a single node.

Input is

snapshot: /snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_ICs/EAGLE_25/eagle_0036.hdf5
param file: /snap7/scratch/dp004/jlvc76/SWIFT/EoS_tests/swiftsim/examples/EAGLE_ICs/EAGLE_25/vrconfig_3dfof_subhalos_SO_hydro.cfg

Happy to copy anything over to other places or give you more info is needed.

rtobar commented 3 years ago

@MatthieuSchaller interesting... if you say that a previous build worked and now it didn't then I introduced a regression. My guess is that the diagnosis is correct; i.e., there are particles with non-negative densities, leading to -inf potentials, but that in your case these do not make it to DetermineDenVRatioDistribution where the crash reported in #37 happened. I still need to corroborate this, so for the time being it's just a guess.

I'm happy to revert the change that adds the check for non-positive densities. That would remove this regression, but OTOH it could mean we are doing wrong calculations silently.

MatthieuSchaller commented 3 years ago

Yes, agreed, it's not ideal. Just thought this might help track things down somehow.

MatthieuSchaller commented 3 years ago

@JBorrow does the latest master without OMP but with MPI give you an alternative that is fast enough?

rtobar commented 3 years ago

@JBorrow @MatthieuSchaller please see https://github.com/ICRAR/VELOCIraptor-STF/issues/37#issuecomment-749419902 for a very likely potential fix for this problem. If that fix makes this issue disappear then we can close this ticket.

rtobar commented 3 years ago

I'm trying this out in cosma6 at the moment. Sadly the fix mentioned in https://github.com/ICRAR/VELOCIraptor-STF/issues/37#issuecomment-749419902 does not seem to address the problem reported in this issue, although it's probably very similar in nature. At the moment I'm trying to gather more information and seeing if there's anything obvious around https://github.com/ICRAR/VELOCIraptor-STF/commit/8b0cf42a686808d14e9ede750dec57a994c14a87, 9a21987 and https://github.com/ICRAR/VELOCIraptor-STF/commit/08595f84a0a5858962320e50dfc5016b9aea4666 that could help here.

rtobar commented 3 years ago

I'm commenting here to to leave a trail: in #60 yet another instance of this problem has been reported. Details can be found throughout the comments, including build configuration details.

rtobar commented 3 years ago

Another instance of this issue with hydro runs was reported in https://github.com/ICRAR/VELOCIraptor-STF/issues/37#issuecomment-781771823 for a zoom, non-MPI execution.

rtobar commented 3 years ago

@MatthieuSchaller with the recent surge of more cases of errors due to negative potentials (a check I added as part of #37), I'm starting to think maybe I'll turn the check off, or at least turn it into a warning instead of an unrecoverable error, or even make this behavior configurable. In https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-739760984 I argued that invalid results might be end up being generated due to this, but otherwise this problem seems to be affecting it too many areas. Thoughts?

Edit: I just double-checked with Pascal, and he confirmed this check is correct; i.e., all particles at this point of the code should have properly-defined densities.

MatthieuSchaller commented 3 years ago

That would be a worry then. If we can get to that point and find particles that have negative densities it means something went wrong.

I suppose if we now it only happens when OpenMP is on then we can have that issue here closed and taken over by the wider OMP issue.

rtobar commented 3 years ago

@MatthieuSchaller do you know of any (hopefully small) dataset + configuration file that can be used to consistently reproduce this error? I went through all the reported occurrences and either the original input and configuration files are gone, or the error occurred as part of a SWIFT + VR run, so I can't find an easy way to actually trigger it.

I can try to reason a bit more about the code and try to advance like that, but having a small, reproducible example would be great.

MatthieuSchaller commented 3 years ago

That's unfortunate. Let me dig out some older test case that was backed up.

JBorrow commented 3 years ago

Sorry @rtobar; my config at /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs/Run_0 crashes. This just happened now, using the latest maser.

MatthieuSchaller commented 3 years ago

Could you expand on the modules used and cmake flags used?

JBorrow commented 3 years ago

Currently Loaded Modulefiles:
 1) python/3.6.5               5) parallel_hdf5/1.8.20      9) cmake/3.18.1  
 2) ffmpeg/4.0.2               6) gsl/2.4(default)         
 3) intel_comp/2018(default)   7) fftw/3.3.7(default)      
 4) intel_mpi/2018             8) parmetis/4.0.3(default)

cmake -DVR_USE_HYDRO=ON -DCMAKE_CXX_FLAGS="-fPIC" -DCMAKE_BUILD_TYPE=Release ..; make -j

MatthieuSchaller commented 3 years ago

Can you try disabling OpenMP?

JBorrow commented 3 years ago

Yeah, it works fine if you run with MPI only. That's an okay fix for now - it's actually faster for my 25s - but the behaviour should not be different between this and OMP.

On 26 Feb 2021, at 13:24, Matthieu Schaller notifications@github.com<mailto:notifications@github.com> wrote:

[EXTERNAL EMAIL]

Can you try disabling OpenMP?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-786645602, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AB3Z6G6MUP4SENHG5W5KCOLTA6OHPANCNFSM4UNKJL6Q.

MatthieuSchaller commented 3 years ago

Ok, that's interesting. Some of the cases above were breaking both with and without OMP.

JBorrow commented 3 years ago

Okay - I was wrong - I get the negative densities in pure MPI only mode (/cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/Run_0, snapshot 0000).

MatthieuSchaller commented 3 years ago

Do we have a way of knowing that the particle is or what object it belongs to to try to understand what the offending setup is?

rtobar commented 3 years ago

Yes, in hindsight I should have added more information to the error message. I'll do that and push a commit straight to the master branch.

BTW, I tried replicating the issue with the data, settings and compilation flags used by @JBorrow, but without success so far. I tried in a local system with 20 cores, so tries a few combinations of OpenMP threads and MPI ranks (1x20, 3x6, 20x1) and none of them produced the error. Maybe I miseed something obvious so I'll re-try tomorrow, but otherwise this will be a bit harder to track down than anticipated. I also tried briefly to reproduce in cosma, but the queues were full and I was trying to go for an interactive job -- next time I'll just submit a job instead.

MatthieuSchaller commented 3 years ago

Thanks. This is small enough to run on the login nodes if that helps (though I will deny having ever said that when questioned by the admin police...)

JBorrow commented 3 years ago

Could it be the fact I'm using Intel 2018?

MatthieuSchaller commented 3 years ago

Worth checking. I'd think the MPI implementation should make no difference but the compiler may come packed with a different version of OMP.

rtobar commented 3 years ago

OK, some more data points after some experiments in cosma6 with the input and configuration files pointed out by @JBorrow. I tried different variations of MPI/OpenMP ranks/threads in an MPI+OpenMP enabled build, which yielded different results. This is congruent with the results of my local experiments for the same inputs, where different rank/thread combinations yielded different outcomes, so the problem clearly depends on the partitioning of the problem.

28 ranks, 1 thread each (logfile): I reproduced the error. As mentioned earlier, I improved the error message to contain some more information (now that I actually tested it I've put it on the master branch), here's the main extract from the logfile (the rank/thread number are 0-based):

terminate called after throwing an instance of 'std::runtime_error'
  what():  Particle density not positive, cannot continue.

 Information for particle 1/1267: id=1, pid=11349692, type=1, pos=(-0.00781215, 0.000332759, -0.0124483), vel=(11.6557, -4.83983, 74.4938), mass=0.000969505, density=0
Information for execution context: MPI enabled=yes, rank=16/28, OpenMP enabled=yes, thread=0/1

1 ranks, 20 threads (logfile) and 1 rank, 28 threads (logfile): The error doesn't happen anymore and the code progresses much further... until another error occurs:

[0000] [1445.514] [ info] io.cxx:1292 Saving SO particle lists to halos-2.catalog_SOlist.0
terminate called after throwing an instance of 'std::bad_alloc'
  what():  std::bad_alloc
/var/slurm/slurmd/job2998665/slurm_script: Zeile 4: 16915 Abgebrochen             builds/53/stf -i /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/Run_0/eagle_0000 -C /cosma/home/dc-borr1/c7dataspace/XL_wave_1/runs_correct_dt/vrconfig_3dfof_subhalos_SO_hydro.cfg -I 2 -o halos-2

This looks conspicuously close to the work added in #57 so it might be a new problem, happens within the WriteSOCatalog function, so it's hopefully easy to fix. I've opened #71 to keep track of this.

4 ranks, 7 threads each (logfile) and 7 ranks, 4 threads each (logfile): they both break at a very similar point in time where 1x20 case above breaks, only this time nothing gets printed to screen because the error probably happens not in the main thread (see the logfiles for details). Based on this observation I'm assuming it's the same error in all three cases.

EDIT: added 1x28 results, further constrained new error, added link to new issue.

rtobar commented 3 years ago

To avoid the issue in #71 I tried running the experiments with the latest master that didn't have the changes introduced in #57, but otherwise had all the other fixes we've introduced since we started working on this problem.

The summary is that all but the 28x1 combination ran to completion. I don't think this sheds more light other than confirming that the issue doesn't depend on the inputs only, but actually on how the domain decomposition is handled. All logs are available below, but here's a summary of the successes:

[rtobar@bolano ~]---> grep "VELOCIraptor finished in" mpi-* 
mpi-1-openmp-20.log:[0000] [1310.405] [ info] main.cxx:572 VELOCIraptor finished in 21.839 [min]
mpi-1-openmp-28.log:[0000] [1341.484] [ info] main.cxx:572 VELOCIraptor finished in 22.357 [min]
mpi-4-openmp-7.log:[0000] [ 709.688] [ info] main.cxx:572 VELOCIraptor finished in 11.828 [min]
mpi-4-openmp-7.log:[0001] [ 709.720] [ info] main.cxx:572 VELOCIraptor finished in 11.828 [min]
mpi-4-openmp-7.log:[0002] [ 709.760] [ info] main.cxx:572 VELOCIraptor finished in 11.829 [min]
mpi-4-openmp-7.log:[0003] [ 709.778] [ info] main.cxx:572 VELOCIraptor finished in 11.829 [min]
mpi-7-openmp-4.log:[0000] [ 642.523] [ info] main.cxx:572 VELOCIraptor finished in 10.708 [min]
mpi-7-openmp-4.log:[0003] [ 642.540] [ info] main.cxx:572 VELOCIraptor finished in 10.708 [min]
mpi-7-openmp-4.log:[0005] [ 642.546] [ info] main.cxx:572 VELOCIraptor finished in 10.709 [min]
mpi-7-openmp-4.log:[0001] [ 642.573] [ info] main.cxx:572 VELOCIraptor finished in 10.709 [min]
mpi-7-openmp-4.log:[0002] [ 642.583] [ info] main.cxx:572 VELOCIraptor finished in 10.709 [min]
mpi-7-openmp-4.log:[0004] [ 642.581] [ info] main.cxx:572 VELOCIraptor finished in 10.709 [min]
mpi-7-openmp-4.log:[0006] [ 642.667] [ info] main.cxx:572 VELOCIraptor finished in 10.711 [min]

The failure in 28x1 is exactly the same as in the previous test, so at least is fully reproducible:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Particle density not positive, cannot continue.

 Information for particle 1/1267: id=1, pid=11349692, type=1, pos=(-0.00781215, 0.000332759, -0.0124483), vel=(11.6557, -4.83983, 74.4938), mass=0.000969505, density=0
Information for execution context: MPI enabled=yes, rank=16/28, OpenMP enabled=yes, thread=0/1

mpi-7-openmp-4.log mpi-4-openmp-7.log mpi-28-openmp-1.log mpi-1-openmp-28.log mpi-1-openmp-20.log

rtobar commented 3 years ago

And just another update: this can also be reproduced in a few seconds with much smaller files and less ranks (13 was the minimum for this input/configuration in cosma):

$> OMP_NUM_THREADS=1 salloc -p cosma6 -A dp004 -N 1 --exclusive -t 1:00:00 mpirun -np 13 builds/53/stf -C /cosma7/data/dp004/jlvc76/BAHAMAS/Roi_run/vrconfig_3dfof_subhalos_SO_hydro.cfg -i /cosma7/data/dp004/jlvc76/BAHAMAS/Roi_run/baham_0036 -I 2 -o halos
....
terminate called after throwing an instance of 'std::runtime_error'
  what():  Particle density not positive, cannot continue.

 Information for particle 2473/6517: id=2473, pid=656947, type=1, pos=(0.733834, 0.93133, -0.427506), vel=(-274.327, -45.4808, 313.547), mass=0.552186, density=0
Information for execution context: MPI enabled=yes, rank=0/13, OpenMP enabled=yes, thread=0/1
...

I can reproduce this locally too, so hopefully this will open the door to proper debugging and further advances.

JBorrow commented 3 years ago

How is it possible that the particles have negative positions?

MatthieuSchaller commented 3 years ago

We box-wrap when writing the snapshots so that should be clean. The numbers also look too large for a rounding error.

JBorrow commented 3 years ago

Yes, I've checked in the snapshots and there are no particles outside of the box.

Could it be an error in the size of the box (as this only seems to happen to me, anecdotally, for z!=0) and then an incorrect re-wrap?

MatthieuSchaller commented 3 years ago

The last example above is at z=0.

JBorrow commented 3 years ago

Ah. Sorry, you're right. The original example is at z=0 but the ones recently have all been at z=5. VR then happily runs on the z=3 snapshots from the same simulations, though.

rtobar commented 3 years ago

I remember Pascal telling me once that the position field in a Particle object is not a fixed value; during the lifetime of a run they are continuously updated, for example, to be relative to the centre of gravity of the group, or to other reference points.

MatthieuSchaller commented 3 years ago

So that may not be the smoking gun we think it is. I suppose if the particles are put in the frame of the centre of mass then half the particles or so will have negative coordinates.

rtobar commented 3 years ago

A small update: with the smaller test case mentioned in https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-791198771, I found that the particle in question has density=0 because this particle didn't have its density calculated, not because the result of the calculation was 0. This is a small, but important piece of information, and it should help finding out what's wrong.

I was also able to reproduce the error earlier in the code, so instead of it failing in GetDenVRatio I can see the same error for the same particle (same pid and type, but different id/pos/vel values because of the dynamic nature of these fields) in the same rank at the end of GetVelocityDensityApproximative, which is where densities are calculated. This brings the cause and effect much closer in time, hopefully making it easier to analyse the problem.

MatthieuSchaller commented 3 years ago

Interesting. Could this particle be far from the centre of its object and hence be part of the spherical over-density but not of any actual sub-structure?

JBorrow commented 3 years ago

The case /cosma/home/dc-borr1/c6dataspace/XL_wave_1/runs/Run_8 for snapshot 6 (z=0) is a new crashing case. This breaks for both MPI and OpenMP only runs. Vr version, config log, etc, all linked in the directory under the velociraptor_submit_0006.slurm file.

rtobar commented 3 years ago

I could finally dedicate some more time today to this problem.

As mentioned previously in https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-800028795 some particles end up with density=0 because their density is not computed, not because the result of the computation is 0. Particles are not iterated directly, but instead they are first grouped into leafnode structures (I'm assuming these are the leaf nodes of the KDTree containing the particles). The code then iterates over the leafnode structures, and for each leafnode the contained particles are iterated over.

Using the smaller reproducible example mentioned in https://github.com/ICRAR/VELOCIraptor-STF/issues/53#issuecomment-791198771 I've now observed how this plays out. When MPI is enabled density calculation is a two-step process:

First each rank loops through its leafnode structures to iterate over its particles. Before calculating densities it first checks if there is some overlap with particles from a different rank, and if so the leafnode is skipped. For those that are not skipped, the contained particles have their densities calculated.
During the second pass the overlapping is taken care of: particle information is sent across ranks so they can now compute densities for the particles in leafnodes that were skipped in the first step.

Note that the extra overlapping checks and particle communication happens only if the Local_velocity_density_approximate_calculation is ==1. but if a greater value is given (e.g., 2) then only the first step is taken and no local leafnode structures are skipped. So a workaround could be to set this value to 2, although presumably some results will change.

The problem I've reproduced happens when during the second step particle information is exchanged, but a rank doesn't receive any particle information. In such cases the code explicitly skips any further calculations, and any leafnode that were skipped in the first step are not processed, leading to particles whose densities are not calculated. This sounds like a bug either in the calculation of the overlaps (there are two flavours depending on whether MPI mesh decomposition is used, I'm using it), or in the code exchanging particle information. In other words, if overlapping is found then particle information is expected to arrive. Alternatively it might be correct to not expect particle information to arrive (only sent) in certain scenarios, in which case the code should treat such leafnode structures again as purely local and process them.

I'm not really sure what's the best to do here. @pelahi, given the situation described above would you be able to point us in the right direction?

This analysis is consistent with at least the MPI-enabled crash @JBorrow reported in the comment above. In there one can see:

[0026] [ 622.342] [debug] localfield.cxx:942 Searching particles in other domains 0

The giveaway here is the 0 (as in "no overlapping particles were imported into my rank"). This is what leads then to the fatal crash in rank 26:

terminate called after throwing an instance of 'std::runtime_error'
  what():  Particle density not positive, cannot continue.

 Information for particle 1377/4476: id=1377, pid=9634564, type=1, pos=(0.0820553, -0.0173851, -0.0163657), vel=(13.4933, -32.6683, -5.64264), mass=0.000969505, density=0
Information for execution context: MPI enabled=yes, rank=26/28, OpenMP enabled=yes, thread=0/1

The crash in the non-MPI, OpenMP case might be yet a different problem.

pelahi commented 3 years ago

Hi @rtobar , I'll look into this later today. I think it is likely a simple logic fix.

rtobar commented 3 years ago

Thanks to @pelahi there is now a new commit on the issue-53 branch (371b685) that addresses the problem highlighted on my last comment. The underlying problem was that the code was incorrectly assuming that if during the first processing step one the leafnode structures was skipped then at least some particle data had to be received from the other ranks. This assumption was incorrect: when no data was received but there were leafnode structures that were skipped during the first step, then the skipped structures still had to be processed locally during the second step. This fix applies

With this patch I can now run the small defective test case until the end, meaning that all particles have their densities calculated.

I tried also reproducing the last crash (MPI case) reported by @JBorrow and with the patch I can get past the original problem. Later on during the execution the code crashes again though, but seems like an unrelated issue (more like #71 or #73, but haven't dug on it).

With this latest fix I feel fairly confident the underlying problem is finally gone for the MPI-enabled cases, but there still remains the MPI-disabled problem @JBorrow is still having, and that I think has happened in a few other places. So while I'll put the latest fix on the master branch, I think we can't still call it a day here.

MatthieuSchaller commented 3 years ago

I like the sound of this!

JBorrow commented 3 years ago

Thanks for your hard work!

rtobar commented 3 years ago

Just a quick update: I reproduced the MPI-disabled crash that @JBorrow experimented in cosma6 with the same dataset. Here's a small backtrace:

(gdb) bt
#0  0x00007f9f04af2387 in raise () from /lib64/libc.so.6
#1  0x00007f9f04af3a78 in abort () from /lib64/libc.so.6
#2  0x00007f9f057261e5 in __gnu_cxx::__verbose_terminate_handler () at ../../../../libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x00007f9f05723fd6 in __cxxabiv1::__terminate (handler=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_terminate.cc:47
#4  0x00007f9f05722f99 in __cxa_call_terminate (ue_header=ue_header@entry=0x7f99d4de90a0) at ../../../../libstdc++-v3/libsupc++/eh_call.cc:54
#5  0x00007f9f05723908 in __cxxabiv1::__gxx_personality_v0 (version=<optimized out>, actions=2, exception_class=5138137972254386944, ue_header=0x7f99d4de90a0, context=<optimized out>) at ../../../../libstdc++-v3/libsupc++/eh_personality.cc:676
#6  0x00007f9f050b5eb3 in _Unwind_RaiseException_Phase2 (exc=exc@entry=0x7f99d4de90a0, context=context@entry=0x7f9a9cbe09a0) at ../../../libgcc/unwind.inc:62
#7  0x00007f9f050b66de in _Unwind_Resume (exc=0x7f99d4de90a0) at ../../../libgcc/unwind.inc:230
#8  0x0000000000572846 in GetDenVRatio(Options&, long long, NBody::Particle*, long long, GridCell*, Math::Coordinate*, Math::Matrix*) ()
#9  0x0000000000571e4c in GetDenVRatio(Options&, long long, NBody::Particle*, long long, GridCell*, Math::Coordinate*, Math::Matrix*) ()
#10 0x00000000005b0390 in PreCalcSearchSubSet(Options&, long long, NBody::Particle*&, long long) ()
#11 0x000000000058eb50 in SearchSubSub(Options&, long long, std::vector<NBody::Particle, std::allocator<NBody::Particle> >&, long long*&, long long&, long long&, PropData*) ()

No surprises here, this is where the code has always been crashing. However with the latest changes that have been integrated into the master branch there is a double-check for density != 0 in all particles at the end of GetVelocityDensityApproximative (the function where densities are calculated) to crash earlier rather than later. This means that the problem we are facing here is different: the particles that have their densities = 0 have them as such not because they were skipped by GetVelocityDensityApproximative, but because they never made it into the call. With more time I'll try to continue digging, and hopefully I'll also find a smaller, reproducible example that I can use locally to better pin down what's going on.

JBorrow commented 3 years ago

Another clue here, from running on COSMA-8 (this has 128 cores/node). The code was far more likely to crash when running on all 128 (rather than the 32 that I resubmitted with) cores, in the MPI-only configuration, with this problem.

james-trayford commented 3 years ago

Hi, new to the party, but I'm finding this behaviour for commit 8f380fc. A difference here is this is a zoom run, with VR configured with VR_ZOOM_SIM=ON (mpi off, omp on) on cosma7. Just found this issue, and still need to go through the whole thread to see if any of the suggestions fix things, but thought worth reporting here for now

terminate called after throwing an instance of 'vr::non_positive_density' what(): Particle density not positive, cannot continue.

Particle information: id=0, pid=2604476, type=1, pos=(-0.00540459, 0.00496019, -0.00370049), vel=(-4.63392, 13.0163, -20.7466), mass=0.000121167, density=0 Execution context: MPI enabled=no, OpenMP enabled=yes, thread=0/1[ 145.889] [debug] search.cxx:2719 Substructure at sublevel 1 with 634 particles

MatthieuSchaller commented 3 years ago

Welcome to the party...

The current "workaround" to just get somewhere with production runs is to toggle MPI and OMP on/off. One combination might be lucky...

Apart from that, your simulation is likely quite small so could you give the directory on cosma, as well as config file? Might be a useful small test to see what may be going on. The more problematic examples we have the more likely we are to identify the issue.

james-trayford commented 3 years ago

Welcome to the party...

The current "workaround" to just get somewhere with production runs is to toggle MPI and OMP on/off. One combination might be lucky...

Thanks, configuring with MPI on seems to be working fine for me now.

Apart from that, your simulation is likely quite small so could you give the directory on cosma, as well as config file? Might be a useful small test to see what may be going on. The more problematic examples we have the more likely we are to identify the issue.

yes the zooms seem to have a high hit-rate with this bug (3/3 of the last I've tried), the last one I was using is here for now: /cosma7/data/dp004/wmfw23/colibre_dust/runs/zooms/Auriga/h1/data/snapshot_0013.hdf5

ICRAR / VELOCIraptor-STF

Negative densities / other crashes with latest master #53