VLSV write and memory corruption issues, related to jemalloc usage

sandroos commented 9 years ago

This issue reports multipile issues related to vlsv wriiting and memory corruption:

First reported Magnetosphere output frequency bug
- INVALID (?)
test_trans (project, not testpackage one) binary after footer
- Arto has seen in local install, has not been reproduced by others?
- Most likely related to the next issue, and thus fixed by https://github.com/fmihpc/vlsv/pull/7
test_trans missing start tag
- Has been reproduced on Voima and local installs, but only without jemalloc. When using jemalloc the issue does not exist
- This bug appears after the following commit in VLSV: https://github.com/fmihpc/vlsv/commit/42dd46ddaf38603a704087c82de5e2e9f38af5e6
- Fix candidate in https://github.com/fmihpc/vlsv/pull/7
core file produced by testpackge test_trans
- Has been reproduced, segmentation fault in je_free (right at the end).
- Discussed further in #152

OLD: I ran the test package with master and set higher output frequency to bulk files in the Magnetosphere test and this comes out (pay attention to the file time stamps):

-rw-rw---- 1 sandroos ext-usr 1695250 May 18 11:50 bulk.0000000.vlsv
-rw-rw---- 1 sandroos ext-usr 2052862 May 18 11:51 bulk.0000001.vlsv
-rw-rw---- 1 sandroos ext-usr 2181310 May 18 11:52 bulk.0000002.vlsv
-rw-rw---- 1 sandroos ext-usr 2507242 May 18 11:53 bulk.0000003.vlsv
-rw-rw---- 1 sandroos ext-usr 1927480 May 18 11:49 bulk.0000004.vlsv
-rw-rw---- 1 sandroos ext-usr 1941412 May 18 11:49 bulk.0000005.vlsv
-rw-rw---- 1 sandroos ext-usr 1939168 May 18 11:49 bulk.0000006.vlsv
-rw-rw---- 1 sandroos ext-usr 1955932 May 18 11:49 bulk.0000007.vlsv
-rw-rw---- 1 sandroos ext-usr 1978408 May 18 11:49 bulk.0000008.vlsv
-rw-rw---- 1 sandroos ext-usr 2003032 May 18 11:49 bulk.0000009.vlsv
-rw-rw---- 1 sandroos ext-usr 2014768 May 18 11:49 bulk.0000010.vlsv
-rw-rw---- 1 sandroos ext-usr 2025916 May 18 11:49 bulk.0000011.vlsv
-rw-rw---- 1 sandroos ext-usr 2027200 May 18 11:49 bulk.0000012.vlsv
-rw-rw---- 1 sandroos ext-usr 2031880 May 18 11:49 bulk.0000013.vlsv
-rw-rw---- 1 sandroos ext-usr 2040244 May 18 11:49 bulk.0000014.vlsv
-rw-rw---- 1 sandroos ext-usr 2052172 May 18 11:49 bulk.0000015.vlsv
-rw-rw---- 1 sandroos ext-usr 2057476 May 18 11:49 bulk.0000016.vlsv
-rw-rw---- 1 sandroos ext-usr 2059612 May 18 11:49 bulk.0000017.vlsv
-rw-rw---- 1 sandroos ext-usr 2063620 May 18 11:49 bulk.0000018.vlsv
-rw-rw---- 1 sandroos ext-usr 2064544 May 18 11:49 bulk.0000019.vlsv
-rw-rw---- 1 sandroos ext-usr 2066512 May 18 11:49 bulk.0000020.vlsv

Logfile shows (grep "Writing" logfile.txt)

(IO): Writing spatial cell and reduced system data to disk, tstep = 0 t = 0
(IO): Writing spatial cell and reduced system data to disk, tstep = 360 t = 10.0175
(IO): Writing spatial cell and reduced system data to disk, tstep = 719 t = 20.0071
(IO): Writing spatial cell and reduced system data to disk, tstep = 1079 t = 30.0245
(IO): Writing spatial cell and reduced system data to disk, tstep = 1438 t = 40.0142
(IO): Writing spatial cell and reduced system data to disk, tstep = 1797 t = 50.0038

which basically correspond to the first five files. So this seems like a bug, somewhere...

sandroos commented 9 years ago

Unrelated, but for whatever reason I always get a core dump in transtest. The run completes, however.

ykempf commented 9 years ago

It might be due to capitalism? (Excuse of the day.)

Is the core dump related to the pthread bug we've seen in recent weeks? What does totalview say? Seems strange that suddenly with only trivial and minor changes everything gets weird...

sandroos commented 9 years ago

I think that the core dump has been there forever but there are no error messages.

After re-re-compiling everything and re-re-running the master testpackage is going through again, I'll report if I encounter this issue again.

galfthan commented 9 years ago

Looks weird, I will test myself to see what happens. Can you share the changes to the cfg file?

sandroos commented 9 years ago

Maybe related or not, but now I'm seeing weird bugs on my laptop.

I ran test_trans project locally with MPI and some VLSV files have stuff in them after the footer. I've never seen this happen before, so I assume that there's a memory leak somewhere that happens to poke the footer contents while it is still in memory.

sandroos commented 9 years ago

   <VARIABLE arraysize="8000" datasize="8" datatype="uint" mesh="SpatialGrid" name="CellID" vectorsize="1">568</VARIABLE>
   <VARIABLE arraysize="8000" datasize="4" datatype="uint" mesh="SpatialGrid" name="Blocks" vectorsize="1">213236</VARIABLE>
   <VARIABLE arraysize="8000" datasize="8" datatype="float" mesh="SpatialGrid" name="rho" vectorsize="1">245236</VARIABLE>
</VLSV>
H�9����KH�9����KH�9����KH�9 <a lot more binary>

galfthan commented 9 years ago

Never seen the latter problem either...

sandroos commented 9 years ago

I've recompiled VLSV and Vlasiator a few times now and this is still reproducible. Basically when I do

export OMP_NUM_THREADS=1
mpirun -n 4 ./vlasiator --run_config=test_trans.cfg

on my laptop some of the vlsv files are corrupted (test_trans points to the config file in projects/test_trans).

sandroos commented 9 years ago

This also does not appear every time I run test_trans, i.e., may need to re-run it once or twice, which again hints at memory leak or out-of-bounds type of error.

sandroos commented 9 years ago

Even on the occasions when it does work, the output is obviously not correct, so idk what's going on...

movie

ykempf commented 9 years ago

So what specifically is wring? Looks ok to me at this speed.

sandroos commented 9 years ago

No it's not ok, the stuff is flowing out through a non-existing boundary. For some reason it behaves as the simulation box was much smaller (visit bounding box is correct).

ykempf commented 9 years ago

ah indeed then...

ykempf commented 9 years ago

I am trying to reproduce but cannot get it visualised properly...

sandroos commented 9 years ago

visit hangs up when you open one of the corrupt files.

tail *.vlsv | less

ykempf commented 9 years ago

they weren't corrupt... at least nothing after the footers

ykempf commented 9 years ago

One idea: could it be that neighbouring blocks are not created and then everything just escapes through velocity space boundaries? EDIT looks like it, setting a threshold of 0 gives larger blobs.

ykempf commented 9 years ago

I have files which don't work, some open properly, some freeze visit.

ykempf commented 9 years ago

analysator also complains about these files.

ykempf commented 9 years ago

The footer is not opened in the corrupt files, the <VLSV> tag is missing at the top of the footer.

ykempf commented 9 years ago

Does it get overwritten by sth? It could be related actually: no acceleration does not update velocity space properly (hence the bleeding out you got), some vector lengths are incorrect and sometimes the opening of the footer gets overwritten?

sandroos commented 9 years ago

It appears that there are multiple bugs.

The one causing the invisible "boundaries" is because velocity blocks do not get adjusted if acceleration is turned off, which is the case in test_trans (I fixed this in multipop but forgot to add to master). The fix is in https://github.com/fmihpc/vlasiator/tree/memory-bug-fixes

I still get corrupt VLSV files, however, so the hunt continues..

sandroos commented 9 years ago

I suspect that there's a memory corruption somewhere, basically the code somewhere writes to a memory location that it is not supposed to touch, and quite often that location is in the footer and thus it gets corrupted.

galfthan commented 9 years ago

The test trans issue is clearly that adjust blocks is not done. I must have screwed up something in testing, since the testpack should have caught that. Can it be from before we again started using that regularly, probably...? Without actually looking at history and blame I am pretty sure it happened when implementing substepping in acceleration, that is when it was moved from main loop to acceleration. The good news is it mostly affects this one test case and not much else (= actual science runs we have done).

The vlsv one is more weird, since it has been very reliable for us at scale. On which platforms have these files been encountered? Has it only been seen in test_trans files?

galfthan commented 9 years ago

Is it also happening with the vlsv version from before the optimized writing, where we started to write the footer with collective MPI instead of with serial posix.

ykempf commented 9 years ago

Well it has been reliable up to the changes of the last few days...

galfthan commented 9 years ago

Yep. There are at least two recent changes that might affect it 1) the new vlsv write optimization, 2) the removal of the old writer interface. By testing before/after one should be able to pinpoint where it happened, or if the reason is somewhere else.

galfthan commented 9 years ago

Ok, I could reproduce on the third run the issue in test trans on Voima. Testing older vlsv version.

galfthan commented 9 years ago

VLSV version e1b3bb1c57956631f477660a557db1faa6374176 from Jan also exhibits this bug with the latest Vlasiator version. Testing older vlasiator version next.

sandroos commented 9 years ago

The test trans issue is clearly that adjust blocks is not done. I must have screwed up something in testing, since the testpack should have caught that. Can it be from before we again started using that regularly, probably...?

Test package doesn't run test_trans project, it runs MultiPeak with multiple spatial cells (and that run is called test_trans). So that's why it wasn't caught.

Without actually looking at history and blame I am pretty sure it happened when implementing substepping in acceleration, that is when it was moved from main loop to acceleration. The good news is it mostly affects this one test case and not much else (= actual science runs we have done).

I think this one is on me, I had corrected it in multipop branch earlier.

ykempf commented 9 years ago

@Sebastian, did you delete this message from github after posting it initially? I don't see it on the webpage, only in my mailbox.

On 19/05/15 17:46, Sebastian von Alfthan wrote:

It is not actually true that it has been reliable until the last few days. I tested a few versions, will continue tomorrow. Current results:

c88d5491bbf4eed7625a04462fe6ba11d35ddab5 Feb PASSED (well 5 runs passed cleanly) eee94f1832e18919b7b60e35770e6c16c8acd230 May,5 FAIL on third run (before merge of old vlsv removal which was only in read) 72f29f90d2bfe7ec0c28275845974f3bed934293 Apr,9 FAIL on second run

Bug introduction not yet identified, but seems to have happened before 72f29

— Reply to this email directly or view it on GitHub https://github.com/fmihpc/vlasiator/issues/149#issuecomment-103532003.

ykempf commented 9 years ago

c88d5491bbf4eed76 broken for me EDIT ah but that's April, not Feb.

ykempf commented 9 years ago

vlsv 42dd46d is the culprit. I tested old vlsv versions against latest vlasiator master, the first one to break (in my test, i.e. grep "<VLSV>" bulk*vlsv) is Optimized footer writing during close() https://github.com/fmihpc/vlsv/commit/42dd46ddaf38603a704087c82de5e2e9f38af5e6.

galfthan commented 9 years ago

Well I could also reproduce it with VLSV e1b3bb1c57956631f477660a557db1faa6374176 from January witht he latest Vlasiator master....

ykempf commented 9 years ago

That one is clean for me too. Does it make a difference that I don't compile with jemalloc?

sandroos commented 9 years ago

I tested few vlsv versions and got corrupt files, but that was before I fixed vel block adjustments. I should do more local testing.

ykempf commented 9 years ago

For me it breaks systematically when it breaks. No need to repeat. And even if I have all blocks created.

sandroos commented 9 years ago

For the record, I always get a core dump in test package 'test_trans' directory, may be related.

ykempf commented 9 years ago

I think the core dump is related to the simulation closing which throws some pthread errors sometimes.

ykempf commented 9 years ago

What is broken in /lustre/tmp/alfthan/vlasiator/testpackage/run_2015.05.20_10:23:28/transtest_2_maxw_500k_100k_20kms_20x20/fullf.0000001.vlsv? To me it is valid.

galfthan commented 9 years ago

Ok, the break I had with the old vlsv version was actually related to testing too early.

So where is the actual test case (cfg and vlsv files?)

ykempf commented 9 years ago

On my laptop I have one. vlasiator master projects/test_trans, nothing else. If you want the nominal behaviour without fixing the block thing, then change threshold to 0.

sandroos commented 9 years ago

Should also copy paste the one-line fix from memory-bug-fixes (or something similar) branch, typing from my tablet :)

galfthan commented 9 years ago

Happy traveling. The fix looks quite safe, and unless you see any reason not to I could merge it. Alternatively one could also move the adjust blocks to be after the goto label.

ykempf commented 9 years ago

I ported by hand, still broken. So that fix is not the issue for the missing footer tag.

galfthan commented 9 years ago

Ok, I tested latest vlasiator + vlsv and it is fine on Voima. So after all of this hobaloo this is then essentially only affecting the MPI version on laptops? Has this been shown on voima?

Of course we still want it work on all platforms. Could be a mpi library bug too.

sandroos commented 9 years ago

I got open mpi on my laptop and voima has mpich2, so it's always possible that they do things differently. However, in the past all weird-what-is-mpi-doing-now type bugs have been my own messups.

voima does memory allocation differently to ubuntu on my laptop (or it might just behave differently). For example, on my laptop I'm very likely to get array elements initialized to zero when I create one, while on voima the elements have fairly random values. I've caught numerous bugs this way...

So I'm gonna do more bug hunting ...

ykempf commented 9 years ago

@Arto, do you have jemalloc on your laptop? I now compiled without jemalloc on voima and get a similar bug as on my laptop (where I don't use jemalloc either).

galfthan commented 9 years ago

@ykempf, where is you data?

ykempf commented 9 years ago

/lustre/tmp/yann/debug/test_trans

fmihpc / vlasiator

VLSV write and memory corruption issues, related to jemalloc usage #149