enzo-project / enzo-e

A version of Enzo designed for exascale and built on charm++.
Other
29 stars 35 forks source link

Circleci cosmo-bcg Timeout Issue in VLCT PR #61

Closed mabruzzo closed 3 years ago

mabruzzo commented 4 years ago

This documents a Circleci problem that initially manifested for PR #9, after merging in changes from PR #12 in commit. The issue first appeared in https://github.com/enzo-project/enzo-e/pull/9/commits/43341c9a140446cac153039b73f98a1ffe7e41f8, which was the first commit pushed to GitHub following the merge.

To briefly summarize the problem, the cosmo-bcg test, added by PR #12, failed when it was run on Circleci due to a timeout. However, neither @jobordner nor myself were able to replicate the issue on our local machines; when running it locally, we each found that the test passed when run locally. @jobordner recently found a work-around for this issue with commit 86b76931f33a52e61718982e2ac52811931752d8.

This is largely being made to archive the discussion that @jobordner and I had about this issue over Slack

mabruzzo commented 4 years ago

On October 22nd, @jobordner wrote:

It looks like one of the tests was taking too long so circleci timed out. I'm looking into it.

mabruzzo commented 4 years ago

On October 23rd, @mabruzzo wrote:

@jobordner Do you have any thoughts on why the test takes so long on the testing server? I realized the timeout was a problem a couple of weeks back (right before our last telecon). I kind of got distracted and forgot to come back to this. When I run the test on my local machine, it finishes in a timely manner. I find it really hard to think of anything that I could have changed that would have caused this problem. The only thing I could think of is my tweak to the FieldFace::field_list_src_ and FieldFace::field_list_dest_ methods to address handling of temporary fields. But I find it hard to imagine that this is really the source of our issues. I'll look into this more on Monday.

mabruzzo commented 4 years ago

On October 23rd, @jobordner wrote:

@mabruzzo I've been having difficulty replicating it as well. I reran it in circleci while logged in to view the output files, and it just seems to slow down when the mesh starts refining (the problem test is test_cosmo-bcg.unit) It may not be refresh related, just anything slowing the problem down such that it times out after 10 minutes.

mabruzzo commented 4 years ago

On October 23rd, @mabruzzo wrote:

I'm not particularly familiar with the refinement machinery. Is the slowdown during the refinement within a single cycle or is the refinement over multiple cycles. (I guess I'm asking whether the whole code is slower or if I somehow slowed down the refinement machinery)

mabruzzo commented 4 years ago

On October 26th, @jobordner wrote:

@mabruzzo It seems to be related to refinement, and I've only been able to verify it on circleci. Attached are times for the cosmo-bcg test on circleci for the current master and the vlct branch; refinement starts at around cycle 120. vlct-cosmo-bcg

mabruzzo commented 4 years ago

On November 2nd, @mabruzzo wrote:

@jobordner Okay, so I finally spent some time looking into what could be causing this timeout over the weekend.

I sshed into the circleci build server and found out that this appears to be some kind of memory issue. Interestingly, while the test times out (it takes more than 10 minutes) during the regular build, within my ssh session, I get a memory error after 4.5 minutes (when the total amount of allocated memory tries to exceed a gigabyte). For reference, this is the stack traceback:

[0] Stack Traceback: [0:0] enzo-p 0x55f21a340221 [0:1] enzo-p 0x55f21a6739d8 Memory::allocate(unsigned long) [0:2] enzo-p 0x55f21a624213 FieldData::allocate_temporary(FieldDescr const, int) [0:3] enzo-p 0x55f21a48ee3d EnzoSolverBiCgStab::apply(std::shared_ptr, Block) [0:4] enzo-p 0x55f21a43be07 EnzoMethodGravity::compute(Block) [0:5] enzo-p 0x55f21a5f1d90 Block::computecontinue() [0:6] enzo-p 0x55f21a584aec CkIndex_Block::_call_p_compute_continue_void(void, void) [0:7] enzo-p 0x55f21a23b88f CkDeliverMessageFree [0:8] enzo-p 0x55f21a267da6 CkLocRec::invokeEntry(CkMigratable, void, int, bool) [0:9] enzo-p 0x55f21a242589 _processHandler(void, CkCoreState*) [0:10] enzo-p 0x55f21a34b44c CsdScheduleForever [0:11] enzo-p 0x55f21a34b97d CsdScheduler [0:12] enzo-p 0x55f21a348b92 ConverseInit [0:13] enzo-p 0x55f21a234359 charm_main [0:14] libc.so.6 0x7f0c8f3bf2e1 __libc_start_main [0:15] enzo-p 0x55f21a1a6d0a _start

I've also made a plot comparing how the bytes-highest value changes with each cycle for both the vlct_integrator branch and master branch inside the ssh session (Note: It's a little concerning that circleci's memory limit is approached in the normal case). When I run both branches locally on my own machine, the resulting curves seem to exactly match the curve for the master branch on circleci.

I did some digging here to try to work out what could be causing the issue with the faling test. To try to be a little scientific about it, I used gcov and lcov --list to assemble a list of files containing code executed by test_cosmo-bcg.unit and cross-matched that with the files modified between branches git diff --name-only master. To compile the code for use with gcov I used the flags_arch = '-g -fprofile-arcs -ftest-coverage' (I think I additionally specified -O0) and flags_link_charm = '-fprofile-arcs' settings from linux_gnu.py.

Of the cross-matched files only two sets of files include changes that slightly modify the behavior on a cycle-by-cycle basis. These include:

  • src/Cello/data_FieldFace.cpp - It doesn't make any sense for any of these changes to cause issues. The differences here include my previously mentioned tweak to FieldFace::field_list_dst_ and FieldFace::field_list_src_ to avoid refreshing temporary fields (when allfields are being refreshed). Additionally, I haven't merged the final bugfix commit from solver-dd that tweaked fixed the indexes in `FieldFace::copy`
  • src/Enzo/enzo_EnzoComputePressure.cpp - I just added support for computing pressure when magnetic fields are present. (As an aside, I realized that I had meant to come back to this and clean this up slightly before the PR is reviewed. I'll get on that). This shouldn't affect things (especially because the test uses the dual energy formalism)

To check if these changes are possibly to blame, I reran the a modified version of the vlct_integrator branch on the circleci session. (I started with the vlct_integrator branch, but checked out the versions of the above 2 files [from the master branch]). I found no difference in the memory usage from running the unaltered version of the vlct_integrator branch on the circleci branch.

The other overlapping files from my cross-match include:

  • src/Enzo/enzo-p.cpp and src/Enzo/enzo_EnzoMethodGravity.cpp - I don't think this should make much difference. I literally just moved the following set of lines to the former from the latter
    #define CK_TEMPLATES_ONLY
    #include "enzo.def.h"
    #undef CK_TEMPLATES_ONLY
  • src/Enzo/enzo_EnzoConfig.hpp, src/Enzo/enzo_EnzoConfig.hpp, and src/Enzo/enzo_EnzoProblem.cpp - These are almost certainly unrelated. My only modifications were to add support for EnzoMethodMHDVlct and some new initializers.
  • src/Enzo/enzo_EnzoReconstructorPLM.hpp, src/Enzo/enzo_EnzoRiemannImpl.hpp - according to lcov, the test only invoked calls to PUPable_decl_template from within each of these files.

I do want to note that gcov/lcov had a little trouble tracing some of the coverage. They claimed some code was executed from a file called <stdout> . There were also some complaints about files called moduleinit102749.gcda and moduleinit102749.gcno , as well as not being able to open parse.l or parse.y.

Anyway, at this point I feel like I've mostly exhausted avenues I can explore. I think I've largely demonstrated that none of my changes directly produced this problem (If you would like I could write a script to reproduce the steps I took with cross-matching git diff and coverage [reports]). I feel like I've a lot of my options of pursuing this (I'm not really sure what more I can do here). In light of this, I would definitely appreciate help with resolving this.

Is it possible, that the charm++ scheduler is somehow affected and maybe more asynchronous tasks (e.g. refresh) are being scheduled simultaneously? I don't totally understand the internals of charm++, but if there was a lot of context switching between tasks that allocate memory, could that possibly explain this? (As a shot in the dark, I tried to shuffle the link order of object files to see if there were instruction cache misses due to the larger binary size - and found no discernible differences).

It might also be worth trying to look at what is making the allocations, right? I'm not sure how to approach that. Is that something valgrind could be used for?

Here is the figure mention above: memory_usage.pdf

mabruzzo commented 4 years ago

On November 5th, @jobordner wrote:

@mabruzzo I've got a grant deadline coming up so can't work on this much right now, but I can look at it more next week. I still haven't been able to duplicate it outside of circleci, but it does seem to be repeatable inside that environment. I'm suspecting a compiler bug, g++ or gfortran, that is in the circleci versions but not our test setups (I've tried Ubuntu 20.04 and 18.04, though circleci compiler versions are still older). I've run into compiler optimization issues before so this wouldn't be unprecedented (you can try setting default optimization from O3 to O1 to see if that helps). Valgrind didn't show anything, though I haven't tried it within circleci yet.

Another thing to try would be setting "memory = 0" in SContstruct to turn off memory tracing to see if that is messing something up (it overloads new and delete, which could be an issue if circleci is also overloading them for its own use)

mabruzzo commented 4 years ago

@jobordner, I guess the fact that you bypassed this issue with https://github.com/enzo-project/enzo-e/pull/9/commits/86b76931f33a52e61718982e2ac52811931752d8 , by changing the default optimization from O3 to O1, means that this was a compiler bug.

I think this probably resolves the issue, but I'll let you have the last word on whether we should close the issue.

PS: Appologies for mentioning you so many times while moving the discussion thread to GitHub. In hindsight, I probably should probably should have recorded things slightly differently.

jobordner commented 4 years ago

@mabruzzo Although I still don't know what specific code is triggering the compiler optimization error, I agree that this issue can be closed. I don't think we should change O3 to O1 to fix it, however; I did that mainly to test a hypothesis. I think a better approach would be to configure circleci to use a newer environment than it is currently using. I think that just involves replacing the image: circleci/python:3.7.2 lines in .circleci/config.yml with something else. I can continue to root out the source of the problem when I can free up some time.

For the sake of further documenting the issue, I've found it's reproducible in Ubuntu 16.04, and the problem is the introduction of gross errors in accelerations along block boundaries starting as soon as cycle 1. The increase in memory use does not appear to be due to a memory leak, but because the errors are triggering adapt to create more refined grids sooner.

mabruzzo commented 4 years ago

Ok, sounds good. I'm going to hold off momentarily on closing it until we agree on how best to fix it.

Looking at the documentation, circleci has started to create newer images based on the most recent LTS release of Ubuntu (although it's somewhat unclear whether they are describing Ubuntu 18.04 or 20.04).

If I understand everything correctly, I think replacing image: circleci/python:3.7.2 with image:cimg/python:3.7.2 will use either Ubuntu 18.04 or 20.04. When I have a chance in the next few days, I'll try this out.

Since this is Circleci related, I'm tagging @brittonsmith, just in case he has anything to add.

jobordner commented 4 years ago

Additionally, 3D flux correction tests crash in double precision in the main enzo-project/enzo-e master branch, but they don't get flagged as errors. This is also repeatable in Ubuntu 16.04, and they also run correctly when O3 optimization is lowered to O1. Single precision works ok (as does cosmo-bcg incidentally). I suspect addressing this cosmo-bcg issue will address the 3D flux tests failing as well. (That this doesn't cause the regression tests to fail is a separate issue that needs to be addressed.)

Tail of current enzo-project/enzo-e double-precision test results:

17:58:57 linux_gnu double       FAIL:  0      Incomplete: 15 Pass: 8817          
   CRASH: test/test_method_flux3-xm.unit
   CRASH: test/test_method_flux3-xp.unit
   CRASH: test/test_method_flux3-ym.unit
   CRASH: test/test_method_flux3-yp.unit
   CRASH: test/test_method_flux3-zm.unit
   CRASH: test/test_method_flux3-zp.unit
END   Enzo-P/Cello ./build.sh: arch = linux_gnu  prec = double  target = test time = 37.36 min
Test run summary

   Test runs attempted: 109
   Test runs started:   109
   Test runs completed: 103
   --------------------
   *** incomplete output: test/test_method_flux3-xm.unit
   *** incomplete output: test/test_method_flux3-xp.unit
   *** incomplete output: test/test_method_flux3-ym.unit
   *** incomplete output: test/test_method_flux3-yp.unit
   *** incomplete output: test/test_method_flux3-zm.unit
   *** incomplete output: test/test_method_flux3-zp.unit
   --------------------

Exiting testing with success
Done.
mabruzzo commented 3 years ago

I have now adopted our discussed workaround in PR #62