Tests - Githubissues

giovanni-rosotti commented 7 years ago

What this branch does is running some of the tests on Travis instead of only compiling the code. The results are not inspected, so the code might still produce nonsense, but the bugs that lead to a crash will be caught. The downside is that Travis will now take much longer to check the code. The reason why this was never merged is because with MPI one very often runs out of memory (as at the moment we have no facility for reallocating the particle arrays) leading to "spurious" crashes. At a certain point the code was modified so that we now allocate much more memory than what we need, and I think in this way the tests will now pass (let's see what the Travis result will be). But this is a poor man solution and we should address the real problem instead of remaining with a workaround. We already have an issue open for the reallocation so I think that in the meanwhile we can probably proceed with this pull request (provided the Travis check passes fine).

giovanni-rosotti commented 7 years ago

Ok, so we do still have problems - as you see the tests with MPI failed. There seems to be two reason: one is due to insufficient memory in the tree, which can be cured with a workaround for now. The other is seg faults in the MPI communications, which is more serious... I'll see what I find

rbooth200 commented 7 years ago

Did you try rebasing onto master at all? I guess this branch is quite out of date at the moment. From eyeballing the travis tests I think it would be useful to get gandalf to print out the test problem being run - it's not easy to tell which tests crash.

giovanni-rosotti commented 7 years ago

I didn't rebase (but shouldn't we merge?), but the second test that Travis runs should be the status of master after the pull request will have been accepted. So it is quite worrying actually that we got seg faults... (unless it's related to insufficient memory not being checked properly). Agree about printing the name of the parameter file being run

rbooth200 commented 7 years ago

Agree about printing the name of the parameter file being run

We could also just print the IC routine that is called instead.

I have been rebasing before merges recently, which seems to work well, but a merge here might also be fine.

As for the seg faults, there are so many possibilities. From the error message with gcc I'd be tempted to suggest that its a bounds error somewhere.

giovanni-rosotti commented 7 years ago

I've implemented a poor-man solution by multiplying the memory allocations by the number of MPI processes. This should guarantee that there will always be enough memory (I think at least. I guess the only exception might be pathological situations with ghosts). On my laptop now everything runs fine, let's see if Travis runs fine. We should then be able to merge this into master. But I don't think we want to keep for long such a huge memory allocation.

giovanni-rosotti commented 7 years ago

As usual I have spoken too early - we still have a problem with DUSTYBOX when using MPI... I'll look into it

giovanni-rosotti commented 7 years ago

So the test actually runs fine on my laptop. According to what I've found online the error I am getting (one of the processes getting killed just with a signal 9) is due to running out of memory. This is worrying given that the test is only using 8000 particles. Nevertheless, for the time being I've tried switching to other test machines provided by Travis that should have more memory.

rbooth200 commented 7 years ago

I wonder if that means there is a memory leak in the dust, although it would be strange for it to appear only under MPI.

rbooth200 commented 7 years ago

I can have a look when I get back on Friday if there is no progress before then

giovanni-rosotti commented 7 years ago

It's not a leak as the memory usage is constant in time. But it is incredibly high: 2,5 GB per process! If I run DUSTYSEDOV, which uses the same number of particles, the memory usage remains sensible (50 MB per process). I wonder where the difference is coming from

rbooth200 commented 7 years ago

Strange, I can't see a good reason at a glance. The two tests are using different algorithms though, which might explain the difference, if not the problem

rbooth200 commented 7 years ago

Ok, so this issue has got nothing to do with the dust. I ran it through valgrind's heap analyser to see what's going. The memory is being used 50/50 by the particle array and the tree. The reason that we are using so much memory is we are allocating memory for far too many particles (600k * Nmpi), mostly because of allocating a load for ghosts & safety factors. At least we can be sure that this won't get out of hand when increasing the simulation size, but I think we shouldn't be allocating so much memory for a small simulation.

We need to have a careful think about the memory model and what we are doing with ghosts...

giovanni-rosotti commented 7 years ago

I see. I am curious to know how we get from 8k to 600k... Anyway, I still don't get why the problem doesn't come up with DUSTYSEDOV, which also uses ghosts

rbooth200 commented 7 years ago

It's run in 2D though, which is the big difference - it allocates space for something like 30-50 times fewer ghosts

giovanni-rosotti commented 7 years ago

Almost there. DUSTYWAVE is failing, but only with hybrid parallelization, and only with gcc. Grrr...

giovanni-rosotti commented 7 years ago

Thinking about it, clang does not support openMP, so it is not surprising that only gcc fails. This points to a problem in the hybrid parallelisation of the dust. It is quite possible as I don't remember having tested it

giovanni-rosotti commented 7 years ago

My last commit is not really meant to solve the problem, but to debug the code I needed to have a valgrind clean version of the code (which now is the case). While doing so, I noticed that there is duplicate code in ProcessParameters between the meshless and sph. I think we should have a similar function for the base class actually, what do you think?

rbooth200 commented 7 years ago

I rebased the tests branch onto master now that we have reallocation set up. Most tests are now running fine, but there was a crash with the dustywave test and hybrid parallelization.

rbooth200 commented 7 years ago

I still can't reproduce this locally and memory checkers don't spot anything (except a ton of leaks).

giovanni-rosotti commented 7 years ago

Same here, I'm really at loss. I do have a couple of commits that I'll push now, but they are to correct unrelated things. I need to find some moral strength to do it, but tomorrow I am going to install ubuntu in a virtual machine and replicate the exact same environment of Travis.

giovanni-rosotti commented 7 years ago

So I finally did what I promised with the virtual machine and I was able to reproduce the error. I haven't understood yet what is going wrong, but I have some clue. During tree construction, we call repeatedly DivideTreeCell until we arrive to a leaf cell. At that point we stock it; during stocking the number of particles in a cell is recomputed using inext, ifrist and i last. The problem is that the number of particles before stocking is different than after! The only thing I can think is that QuickSelect has a bug (maybe it's an obscure edge case) and it's scrambling the array ids, which is used to set up ifirst, ilast and inext. Help welcome!

rbooth200 commented 7 years ago

I can have a look later. But in the meantime you could test it by fully sorting the particle ids. This could be done robustly using std::sort and a simple functor class. That way you should at least be able to rule in / out the quick select algorithm...

rbooth200 commented 7 years ago

There is also the nth_element function which is O(n) and does what we want (i think). Why don't we just use that?

dhubber commented 7 years ago

Oh, that's not good if it's a quick select problem. And that could affect many things if it's true. I thought I'd figured out all edge cases but I guess not then.

dhubber commented 7 years ago

What does that do exactly? Does it sort everything left and right of the selected number, or simply find the value of that element? Also need to check if the left-right division is done correctly after the sort when setting up the pointers. Btw, what value of Nleafmax are you using?

rbooth200 commented 7 years ago

Does it sort everything left and right of the selected number, or simply find the value of that element?

I believe that is what it does, probably using QuickSelect itself, or a median-of-medians algorithm:

http://www.cplusplus.com/reference/algorithm/nth_element/ http://stackoverflow.com/questions/29145520/how-is-nth-element-implemented

dhubber commented 7 years ago

ok, I have some more leads on this problem. It seems the issue is not quickselect because when the select/sort happens, there is still the same (correct) number of particles each side of the division. The immediate issue seems to be that some of the particles are now dead, i.e. their dead flags are set. So when the tree-stocking routine is called, it skips over the dead particles when computing the new cell.N value and therefore seems to have less particles. But actually they are still there in the data structure. If I switch to 1 OpenMP thread, then there are no dead particles any more and the code runs to completion.

So, now the question becomes, why are there dead particles? There's no accretion or anything weird that should cause particles to die right? Or is this simply some memory leak/corruption going on? Any ideas? I'll post any updates if I get further in this problem.

giovanni-rosotti commented 7 years ago

I'm not sure it's memory corruption as valgrind didn't find anything (but on the other hand it didn't crash when I was running under valgrind, so it's not really a valid point). But yes, I don't see why there should be dead particles. This is clearly a bug.

Good to know that QuickSelect is fine at least...

On Mon, Oct 10, 2016 at 4:59 PM, dhubber notifications@github.com wrote:

ok, I have some more leads on this problem. It seems the issue is not quickselect because when the select/sort happens, there is still the same (correct) number of particles each side of the division. The immediate issue seems to be that some of the particles are now dead, i.e. their dead flags are set. So when the tree-stocking routine is called, it skips over the dead particles when computing the new cell.N value and therefore seems to have less particles. But actually they are still there in the data structure. If I switch to 1 OpenMP thread, then there are no dead particles any more and the code runs to completion.

So, now the question becomes, why are there dead particles? There's no accretion or anything weird that should cause particles to die right? Or is this simply some memory leak/corruption going on? Any ideas? I'll post any updates if I get further in this problem.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gandalfcode/gandalf/pull/70#issuecomment-252664302, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdSgqeLNvkOGCahIrMrsHPoW_RkwePWks5qymDNgaJpZM4JzK1y .

giovanni-rosotti commented 7 years ago

I am an idiot, this is MPI, so of course there are going to be dead particles. But DeleteDeadParticles is called soon before BuildTree, so there shouldn't be any dead particles at the moment of tree construction

On Mon, Oct 10, 2016 at 5:05 PM, giovanni rosotti < giovanni.rosotti@gmail.com> wrote:

I'm not sure it's memory corruption as valgrind didn't find anything (but on the other hand it didn't crash when I was running under valgrind, so it's not really a valid point). But yes, I don't see why there should be dead particles. This is clearly a bug.

Good to know that QuickSelect is fine at least...

On Mon, Oct 10, 2016 at 4:59 PM, dhubber notifications@github.com wrote:

ok, I have some more leads on this problem. It seems the issue is not quickselect because when the select/sort happens, there is still the same (correct) number of particles each side of the division. The immediate issue seems to be that some of the particles are now dead, i.e. their dead flags are set. So when the tree-stocking routine is called, it skips over the dead particles when computing the new cell.N value and therefore seems to have less particles. But actually they are still there in the data structure. If I switch to 1 OpenMP thread, then there are no dead particles any more and the code runs to completion.

So, now the question becomes, why are there dead particles? There's no accretion or anything weird that should cause particles to die right? Or is this simply some memory leak/corruption going on? Any ideas? I'll post any updates if I get further in this problem.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/gandalfcode/gandalf/pull/70#issuecomment-252664302, or mute the thread https://github.com/notifications/unsubscribe-auth/ABdSgqeLNvkOGCahIrMrsHPoW_RkwePWks5qymDNgaJpZM4JzK1y .

giovanni-rosotti commented 7 years ago

I did some debug and I found that the crash is not happening when building the main tree, but mpi ghost tree. The reason is that there are some MPI ghosts which are dead. I'm not sure how this is possible, but when they are received by MPI they are already dead. I guess it means that we created a MPI ghost replica of a dead real particle, although I don't see how it's possible given that BuildTree is immediately before the search for MPI ghosts, and it is supposed to get rid of the dead particles.

dhubber commented 7 years ago

You just beat me to it! I was going to say that the problem is happening either in the periodic or MPI ghost tree build (but probably the MPI ghost tree build). I think the DeleteParticles routine only deletes local dead particles so wouldn't touch any dead MPI ghosts? Anyway, I guess we need to locate where the MPI ghosts become 'dead'!

giovanni-rosotti commented 7 years ago

Ah ah! Yes, I agree, DeleteParticles only deals with real particles. But my point is that no MPI ghost should be dead: the mpi ghost tree is built soon after the ghosts have been found, so they didn't have time to die. And they in turn are found soon after the tree for the real particles has been rebuilt, so dead particles should have been removed from the array (so no stilborn ghost particles). So I don't understand how we can end up with dead MPI ghost particles...

dhubber commented 7 years ago

Sure, I know that we should still not have any dead ghosts! I guess we can try and add some asserts to when the MPI ghosts are imported and see if it happens during the import or happens sometime later? Assuming that they are not already dead when loaded, then we need to catch where they become dead! Let's see if it's that easy anyway!

dhubber commented 7 years ago

Okay, I hope I've fixed the problem (at least for the purposes of getting things to work for these tests). It seems the issue was in HydroTree::SearchMpiGhostParticles, where the local ghost tree would contain 0 particles but contained particles from a previous step. The tree was perhaps not re-initialised correctly or had some other issue. But anyway, it was generating particles which did not exist and then these were sent to the other processors as MPI ghosts (which were dead). I think this only happens with MPI because you can have one step with periodic ghosts and the next without, whereas with OpenMP-only you always have periodic ghosts every step. Atm, I've put a simple hack where the periodic tree is ignored if there are no periodic ghosts but ultimately it should be changed to ensure the tree data structure is valid always.

There was also another smaller bug I found where the particle memory was being reallocated in SearchBoundaryGhostParticles (near the end) but if this happens, the search should be repeated, so I've added a do-while loop so it repeats the search if a reallocation is required. I think this should also be changed to avoid adding ghosts to the main arrays (so not to overflow the array) but I haven't done that for now.

I've pushed this and just seeing what Travis thinks of it first. It runs now on the virtual machine anyway so fingers crossed it's fixed now.

rbooth200 commented 7 years ago

I think the changes to SearchBoundaryGhostParticles might be introducing bugs, rather than removing them. The reallocate branch implemented reallocation in this function so that it was detected on the fly - CreateBoundaryGhostParticle does the realloaction and before inserting the ghost particle.

I believe the reallocation in hydro tree is just for reallocating levelneibbuf, which can safely be done at the end of SearchBoundaryGhostParticles.

dhubber commented 7 years ago

Ahh, ok, I didn't realise the reallocate was hidden all the way down there. Yes, think I confused the tree reallocation with the particle array reallocation. But no, it does not introduce any new bugs, it simply performs an unnecessary extra ghost particles search. I'll remove that then.

giovanni-rosotti commented 7 years ago

Ok, good - I agree with what Richard has said about reallocation. But I think we need to investigate why the periodic tree contains dead particles. We are looking for the ghosts and rebuilding the periodic ghost tree at every step by default (remember Ntreebuildstep=1), so I don't see how we can end up with a corrupted periodic ghost tree.

dhubber commented 7 years ago

Basically it's because we're sending the wrong information to BuildTree which is also subsequently leading to the link-list variables (e.g. ifirst and ilast) being giving crappy values. This doesn't affect a tree that actually has particles but if there are supposed to be no particles, then it gets it wrong. And when you build the tree with particles and on the next step with no particles, it can lead to this kind of nonsense happening. I think I know how to fix this properly (i.e. by re-writing parts of BuildTree) but I just wanted to verify that my simple hack (i.e. to ignore the ghost tree if NPeriodicGhost = 0) works first so I know it really is the solution. And as soon as we've verified that, I can think about implementing the better solution!

giovanni-rosotti commented 7 years ago

See #102 . Closing this pull request.

gandalfcode / gandalf

Tests #70