Closed sandroos closed 9 years ago
Unrelated, but for whatever reason I always get a core dump in transtest. The run completes, however.
It might be due to capitalism? (Excuse of the day.)
Is the core dump related to the pthread bug we've seen in recent weeks? What does totalview say? Seems strange that suddenly with only trivial and minor changes everything gets weird...
I think that the core dump has been there forever but there are no error messages.
After re-re-compiling everything and re-re-running the master testpackage is going through again, I'll report if I encounter this issue again.
Looks weird, I will test myself to see what happens. Can you share the changes to the cfg file?
Maybe related or not, but now I'm seeing weird bugs on my laptop.
I ran test_trans project locally with MPI and some VLSV files have stuff in them after the footer. I've never seen this happen before, so I assume that there's a memory leak somewhere that happens to poke the footer contents while it is still in memory.
<VARIABLE arraysize="8000" datasize="8" datatype="uint" mesh="SpatialGrid" name="CellID" vectorsize="1">568</VARIABLE>
<VARIABLE arraysize="8000" datasize="4" datatype="uint" mesh="SpatialGrid" name="Blocks" vectorsize="1">213236</VARIABLE>
<VARIABLE arraysize="8000" datasize="8" datatype="float" mesh="SpatialGrid" name="rho" vectorsize="1">245236</VARIABLE>
</VLSV>
H�9����KH�9����KH�9����KH�9 <a lot more binary>
Never seen the latter problem either...
I've recompiled VLSV and Vlasiator a few times now and this is still reproducible. Basically when I do
export OMP_NUM_THREADS=1
mpirun -n 4 ./vlasiator --run_config=test_trans.cfg
on my laptop some of the vlsv files are corrupted (test_trans points to the config file in projects/test_trans).
This also does not appear every time I run test_trans, i.e., may need to re-run it once or twice, which again hints at memory leak or out-of-bounds type of error.
Even on the occasions when it does work, the output is obviously not correct, so idk what's going on...
So what specifically is wring? Looks ok to me at this speed.
No it's not ok, the stuff is flowing out through a non-existing boundary. For some reason it behaves as the simulation box was much smaller (visit bounding box is correct).
ah indeed then...
I am trying to reproduce but cannot get it visualised properly...
visit hangs up when you open one of the corrupt files.
tail *.vlsv | less
they weren't corrupt... at least nothing after the footers
One idea: could it be that neighbouring blocks are not created and then everything just escapes through velocity space boundaries? EDIT looks like it, setting a threshold of 0 gives larger blobs.
I have files which don't work, some open properly, some freeze visit.
analysator also complains about these files.
The footer is not opened in the corrupt files, the <VLSV>
tag is missing at the top of the footer.
Does it get overwritten by sth? It could be related actually: no acceleration does not update velocity space properly (hence the bleeding out you got), some vector lengths are incorrect and sometimes the opening of the footer gets overwritten?
It appears that there are multiple bugs.
The one causing the invisible "boundaries" is because velocity blocks do not get adjusted if acceleration is turned off, which is the case in test_trans (I fixed this in multipop but forgot to add to master). The fix is in https://github.com/fmihpc/vlasiator/tree/memory-bug-fixes
I still get corrupt VLSV files, however, so the hunt continues..
I suspect that there's a memory corruption somewhere, basically the code somewhere writes to a memory location that it is not supposed to touch, and quite often that location is in the footer and thus it gets corrupted.
The test trans issue is clearly that adjust blocks is not done. I must have screwed up something in testing, since the testpack should have caught that. Can it be from before we again started using that regularly, probably...? Without actually looking at history and blame I am pretty sure it happened when implementing substepping in acceleration, that is when it was moved from main loop to acceleration. The good news is it mostly affects this one test case and not much else (= actual science runs we have done).
The vlsv one is more weird, since it has been very reliable for us at scale. On which platforms have these files been encountered? Has it only been seen in test_trans files?
Is it also happening with the vlsv version from before the optimized writing, where we started to write the footer with collective MPI instead of with serial posix.
Well it has been reliable up to the changes of the last few days...
Yep. There are at least two recent changes that might affect it 1) the new vlsv write optimization, 2) the removal of the old writer interface. By testing before/after one should be able to pinpoint where it happened, or if the reason is somewhere else.
Ok, I could reproduce on the third run the issue in test trans on Voima. Testing older vlsv version.
VLSV version e1b3bb1c57956631f477660a557db1faa6374176 from Jan also exhibits this bug with the latest Vlasiator version. Testing older vlasiator version next.
The test trans issue is clearly that adjust blocks is not done. I must have screwed up something in testing, since the testpack should have caught that. Can it be from before we again started using that regularly, probably...?
Test package doesn't run test_trans project, it runs MultiPeak with multiple spatial cells (and that run is called test_trans). So that's why it wasn't caught.
Without actually looking at history and blame I am pretty sure it happened when implementing substepping in acceleration, that is when it was moved from main loop to acceleration. The good news is it mostly affects this one test case and not much else (= actual science runs we have done).
I think this one is on me, I had corrected it in multipop branch earlier.
@Sebastian, did you delete this message from github after posting it initially? I don't see it on the webpage, only in my mailbox.
On 19/05/15 17:46, Sebastian von Alfthan wrote:
It is not actually true that it has been reliable until the last few days. I tested a few versions, will continue tomorrow. Current results:
c88d5491bbf4eed7625a04462fe6ba11d35ddab5 Feb PASSED (well 5 runs passed cleanly) eee94f1832e18919b7b60e35770e6c16c8acd230 May,5 FAIL on third run (before merge of old vlsv removal which was only in read) 72f29f90d2bfe7ec0c28275845974f3bed934293 Apr,9 FAIL on second run Bug introduction not yet identified, but seems to have happened before 72f29
— Reply to this email directly or view it on GitHub https://github.com/fmihpc/vlasiator/issues/149#issuecomment-103532003.
c88d5491bbf4eed76 broken for me EDIT ah but that's April, not Feb.
vlsv 42dd46d is the culprit.
I tested old vlsv versions against latest vlasiator master, the first one to break (in my test, i.e. grep "<VLSV>" bulk*vlsv
) is Optimized footer writing during close() https://github.com/fmihpc/vlsv/commit/42dd46ddaf38603a704087c82de5e2e9f38af5e6.
Well I could also reproduce it with VLSV e1b3bb1c57956631f477660a557db1faa6374176 from January witht he latest Vlasiator master....
That one is clean for me too. Does it make a difference that I don't compile with jemalloc?
I tested few vlsv versions and got corrupt files, but that was before I fixed vel block adjustments. I should do more local testing.
For me it breaks systematically when it breaks. No need to repeat. And even if I have all blocks created.
For the record, I always get a core dump in test package 'test_trans' directory, may be related.
I think the core dump is related to the simulation closing which throws some pthread errors sometimes.
What is broken in /lustre/tmp/alfthan/vlasiator/testpackage/run_2015.05.20_10:23:28/transtest_2_maxw_500k_100k_20kms_20x20/fullf.0000001.vlsv
? To me it is valid.
Ok, the break I had with the old vlsv version was actually related to testing too early.
So where is the actual test case (cfg and vlsv files?)
On my laptop I have one. vlasiator master projects/test_trans, nothing else. If you want the nominal behaviour without fixing the block thing, then change threshold to 0.
Should also copy paste the one-line fix from memory-bug-fixes (or something similar) branch, typing from my tablet :)
Happy traveling. The fix looks quite safe, and unless you see any reason not to I could merge it. Alternatively one could also move the adjust blocks to be after the goto label.
I ported by hand, still broken. So that fix is not the issue for the missing footer tag.
Ok, I tested latest vlasiator + vlsv and it is fine on Voima. So after all of this hobaloo this is then essentially only affecting the MPI version on laptops? Has this been shown on voima?
Of course we still want it work on all platforms. Could be a mpi library bug too.
I got open mpi on my laptop and voima has mpich2, so it's always possible that they do things differently. However, in the past all weird-what-is-mpi-doing-now type bugs have been my own messups.
voima does memory allocation differently to ubuntu on my laptop (or it might just behave differently). For example, on my laptop I'm very likely to get array elements initialized to zero when I create one, while on voima the elements have fairly random values. I've caught numerous bugs this way...
So I'm gonna do more bug hunting ...
@Arto, do you have jemalloc on your laptop? I now compiled without jemalloc on voima and get a similar bug as on my laptop (where I don't use jemalloc either).
@ykempf, where is you data?
/lustre/tmp/yann/debug/test_trans
This issue reports multipile issues related to vlsv wriiting and memory corruption:
OLD: I ran the test package with master and set higher output frequency to bulk files in the Magnetosphere test and this comes out (pay attention to the file time stamps):
Logfile shows (
grep "Writing" logfile.txt
)which basically correspond to the first five files. So this seems like a bug, somewhere...