gandalfcode / gandalf

GANDALF (Graphical Astrophysics code for N-body Dynamics And Lagrangian Fluids)
GNU General Public License v2.0
44 stars 12 forks source link

nose2 unit tests with async problem #181

Closed dhubber closed 6 years ago

dhubber commented 6 years ago

For some branches/pull requests (e.g. improve_openmp2 and improve_ics3), some unit tests are failing completely, in particular the freefall and rad_ws collapse tests. These tests use the 'CreateTimeData' function to then compute the error via error norms. However, for some reason these branches, despite having only changes as far as I'm aware that are completely independent of the python side, are not collecting the data correctly from the snapshots. I reproduced this problem on my laptop and managed to track it to the 'p=run_async() .... p.wait()' sequence. If I just replaced this with 'run()' it all worked fine but obviously this does not help with the unit tests which I believe need the 'run_async()' function. It is quite possible this is the only reason these branches are failing the Travis tests but I can't be sure until this problem is fixed! Any ideas??

rbooth200 commented 6 years ago

Hi David, Have you changed the format to sf instead of su? That's a mistake because sf doesn't work with MPI.

rbooth200 commented 6 years ago

The other possibility is that the read/write of data is broken, because the run async mode runs the code and then loads the data from file, whereas the run does not.

dhubber commented 6 years ago

Yes, I did change it actually. I knew the column/ascii format did not work but why doesn't the sf work? I thought that was working perfectly fine. Anything which we know doesn't work like this should definitely throw an exception if we don't have time to fix it!

But actually that can't be the (only) reason since when I reproduced this on my laptop, I was compiling only with OpenMP. The improve_openmp2 branch does fail only with MPI runs so perhaps this is the cause here but not with the other branch. I'll look into it now.

dhubber commented 6 years ago

ok, on my laptop, despite only compiling with OpenMP and no MPI, switching back to su format fixed it for the improve_ics3 branch! So if there's a problem with the sf format, it's not just with MPI it seems. Very strange!

rbooth200 commented 6 years ago

It would be great if you could fix this...

dhubber commented 6 years ago

ok, changing back to su format did not fix the problem on Travis and it breaks for everything (as before), serial, OpenMP, MPI and hybrid, so doubt it's a parallel read/write issue. Any async-python stuff is above my pay-grade so any ideas?? :-)

I can see now when you say there's no MPI with the sf you mean there's no write MPI sf function at all, not that it doesn't work! I guess that one slipped by my radar. It's still weird that it all worked fine on my laptop (without MPI) because the sf seems to work fine, both reading and writing! Could it be due to the time difference in reading sf compared to su? If the async is not waiting properly for the read, it could just be pure luck it's working but the sf takes too long so goes wrong? Clutching at straws perhaps!

rbooth200 commented 6 years ago

From what I see on improve openmp2 this has fixed everything except the Radws test, which is known to be fragile (the test value is super sensitive to things like the number of steps taken).

Otherwise, the problems have gone away - I genuinely believe the bug is in the read/write.

dhubber commented 6 years ago

From what I see on improve openmp2 this has fixed everything except the Radws test, which is known to be fragile (the test value is super sensitive to things like the number of steps taken).

Yes, this is what seems to be happening in the improve_openmp2 branch because it is only just failing the error norm so MPI-induced fluctuations are triggering this. But the improve_ics3 branch is different and I don't know why this apparently unrelated bug (to the pull request commits) is affecting this one.

I genuinely believe the bug is in the read/write.

Do you mean a read/write bug in the C++ code or the python code? I can't see any problem in the C++ code (or when I use run() instead of run_async() ) so it has to be a python issue surely?

rbooth200 commented 6 years ago

In the C++ code for the sf format. Perhaps the read if splash looks correct? The python code is agnositic about snapshot format, and using run avoids the pure c++ write to disk and read cycle that is used in run_aysnc.

dhubber commented 6 years ago

In the C++ code for the sf format

But the read/write for the sf runs fine. If I run a simulation with sf and then restart, it works fine and splash outputs things fine. I could be wrong of course but if there's smoke I can't put out the fire :-)

using run avoids the pure c++ write to disk and read cycle that is used in run_aysnc.

What do you mean exactly? If it's not using the C++ code then what is it using??

rbooth200 commented 6 years ago

I don't have time to dig into the python right now, but its hard to explain this from the python side - run_async uses the standard SimBuffer.load_snapshots that is used for reading completed simulations.

Could you test whether the reading of data from snaphots created from simulations that have already run is failing? I would expect problems in both if it is a read/write issue. If this works with su but not sf then my hypothesis is correct - if it doesn't work with either then the problem is SphSnapshot or python related.

dhubber commented 6 years ago

Could you test whether the reading of data from snaphots created from simulations that have already run is failing? I would expect problems in both if it is a read/write issue. If this works with su but not sf then my hypothesis is correct - if it doesn't work with either then the problem is SphSnapshot or python related.

Okay, a bit confused now. I run all variety of tests reading and writing with sf and su and confirmed that they all worked fine so python must be the problem. Then I re-run the python and it failed with the same error for BOTH run() and run_async(). So, the original reason for the issue being opened has now disappeared (I won't say fixed itself since technically it's even more broken)!

However, I also understand more the true source of the errors and it seems to be this 'fickleness' in the radws test that's the main cause as you suggested Richard. I found some bug in the hexagonal array generation for periodic boundaries which I fixed, but it now means the particles are in slightly different positions leading to very slightly different results (and hence test failure). I've just pushed with the radws assertions commented out for now to see if everything else is working fine. We'll need to decide how to deal with these fickle tests since any slight change or even 'improvement' (e.g. ICs, physics or integration scheme) could lead to test failure.

Since this issue is no longer an issue of any kind (somehow), I'll close it and move any relevant discussion over to the appropriate pull requests. There was definitely something funny with the async before but whatever conditions existed to cause that are now different. I'll keep an eye on it in future in case the problem ever resurfaces (and just to prove I'm not insane!)