GridOPTICS / GridPACK

https://www.gridpack.org/
47 stars 22 forks source link

Dynamic simulation memory corruption #175

Closed wperkins closed 1 year ago

wperkins commented 1 year ago

GridPACK dynamic simulation apps (dsf.x, wind.x, etc.) seem to have a memory corruption problem. This manifests in two ways. First, the simulation completes, but the OS (Ubuntu 20, in my case) reports a memory corruption error as described here. Second, when built Release, dxf.x, a SEGV is reported at an odd place, as described here.

I think fixing is key to #164 and #173.

bjpalmer commented 1 year ago

Do you have a specific input file that shows this error? And this is the dsf.x code in applications/dynamic_simulation_full_y?

wperkins commented 1 year ago

Using the fix/testing branch, I consistently see these failures with the 145-bus and 240-bus cases

bjpalmer commented 1 year ago

I'm not seeing any problems with the 145 bus case on constance using the progress ranks runtime. Are you using the two-sided runtime? Is there an input file for the 200 bus case (the closest I see is input_240bus.xml) or did you create your own input?

wperkins commented 1 year ago

I'm not seeing any problems with the 145 bus case on constance using the progress ranks runtime. Are you using the two-sided runtime? Is there an input file for the 200 bus case (the closest I see is input_240bus.xml) or did you create your own input?

Debug or Release? These cases mostly run if GridPACK is built Debug. I get seemingly random memory errors at exit on my Ubuntu system. RHEL may not report such errors. The problem changes when GridPACK is built Release. See this unit test summary. In our previous conversation, you were seeing exactly the same problem I was with the 240-bus and a Release build.

wperkins commented 1 year ago

Using the fix/testing branch, I consistently see these failures with the 145-bus and 240-bus cases

It's 240-bus not 200.

abhyshr commented 1 year ago

For the 145 bus case, the memory error is when some matrix is not being freed. Here's the back trace.

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0xffffffff)
  * frame #0: 0x00000001a9c3a9e8 libsystem_malloc.dylib`tiny_free_no_lock + 1860
    frame #1: 0x00000001a9c3a120 libsystem_malloc.dylib`free_tiny + 496
    frame #2: 0x00000001011e1260 libmpi.40.dylib`mca_coll_base_comm_unselect + 10100
    frame #3: 0x00000001011472e0 libmpi.40.dylib`ompi_comm_destruct + 36
    frame #4: 0x0000000101149208 libmpi.40.dylib`ompi_comm_free + 508
    frame #5: 0x0000000101176c64 libmpi.40.dylib`MPI_Comm_free + 168
    frame #6: 0x00000001020bc51c libpetsc.3.020.dylib`Petsc_Counter_Attr_Delete_Fn(comm=<unavailable>, keyval=<unavailable>, count_val=0x000060000026ad80, extra_state=<unavailable>) at pinit.c:361:5 [opt]
    frame #7: 0x00000001011464a8 libmpi.40.dylib`ompi_attr_delete_impl + 612
    frame #8: 0x0000000101146844 libmpi.40.dylib`ompi_attr_delete_all + 232
    frame #9: 0x0000000101149048 libmpi.40.dylib`ompi_comm_free + 60
    frame #10: 0x0000000101176c64 libmpi.40.dylib`MPI_Comm_free + 168
    frame #11: 0x00000001020c5c04 libpetsc.3.020.dylib`PetscCommDestroy(comm=0x0000000104853018) at tagm.c:331:5 [opt]
    frame #12: 0x000000010209c718 libpetsc.3.020.dylib`PetscHeaderDestroy_Private(obj=0x0000000104853000, clear_for_reuse=PETSC_FALSE) at inherit.c:158:5 [opt]
    frame #13: 0x000000010209c39c libpetsc.3.020.dylib`PetscHeaderDestroy_Function(h=0x000060000026ad28) at inherit.c:93:3 [opt]
    frame #14: 0x000000010224d528 libpetsc.3.020.dylib`MatDestroy(A=0x000060000026ad28) at matrix.c:1418:3 [opt]
    frame #15: 0x00000001002f3dbc dsf.x`gridpack::math::PetscMatrixWrapper::~PetscMatrixWrapper(this=0x000060000026ad20) at petsc_matrix_wrapper.cpp:122:16
    frame #16: 0x00000001002f3e70 dsf.x`gridpack::math::PetscMatrixWrapper::~PetscMatrixWrapper(this=0x000060000026ad20) at petsc_matrix_wrapper.cpp:115:1
    frame #17: 0x00000001002e8f4c dsf.x`void boost::checked_delete<gridpack::math::PetscMatrixWrapper>(x=0x000060000026ad20) at checked_delete.hpp:36:5
    frame #18: 0x00000001002e8f0c dsf.x`boost::scoped_ptr<gridpack::math::PetscMatrixWrapper>::~scoped_ptr(this=0x00006000018772b0) at scoped_ptr.hpp:88:9
    frame #19: 0x00000001002e8edc dsf.x`boost::scoped_ptr<gridpack::math::PetscMatrixWrapper>::~scoped_ptr(this=0x00006000018772b0) at scoped_ptr.hpp:84:5
    frame #20: 0x00000001002e8e78 dsf.x`gridpack::math::PETScMatrixImplementation<std::__1::complex<double>, int>::~PETScMatrixImplementation(this=0x0000600001877280) at petsc_matrix_implementation.hpp:123:3
    frame #21: 0x00000001002e7048 dsf.x`gridpack::math::PETScMatrixImplementation<std::__1::complex<double>, int>::~PETScMatrixImplementation(this=0x0000600001877280) at petsc_matrix_implementation.hpp:122:3
    frame #22: 0x00000001002e7074 dsf.x`gridpack::math::PETScMatrixImplementation<std::__1::complex<double>, int>::~PETScMatrixImplementation(this=0x0000600001877280) at petsc_matrix_implementation.hpp:122:3
    frame #23: 0x00000001001648a4 dsf.x`void boost::checked_delete<gridpack::math::MatrixImplementation<std::__1::complex<double>, int>>(x=0x0000600001877280) at checked_delete.hpp:36:5
    frame #24: 0x000000010016485c dsf.x`boost::scoped_ptr<gridpack::math::MatrixImplementation<std::__1::complex<double>, int>>::~scoped_ptr(this=0x000060000026ad18) at scoped_ptr.hpp:88:9
    frame #25: 0x0000000100163cd0 dsf.x`boost::scoped_ptr<gridpack::math::MatrixImplementation<std::__1::complex<double>, int>>::~scoped_ptr(this=0x000060000026ad18) at scoped_ptr.hpp:84:5
    frame #26: 0x0000000100164900 dsf.x`gridpack::math::MatrixT<std::__1::complex<double>, int>::~MatrixT(this=0x000060000026ad00) at matrix.hpp:137:3
    frame #27: 0x0000000100163d28 dsf.x`gridpack::math::MatrixT<std::__1::complex<double>, int>::~MatrixT(this=0x000060000026ad00) at matrix.hpp:136:3
    frame #28: 0x0000000100163d54 dsf.x`gridpack::math::MatrixT<std::__1::complex<double>, int>::~MatrixT(this=0x000060000026ad00) at matrix.hpp:136:3
    frame #29: 0x00000001001650ec dsf.x`void boost::checked_delete<gridpack::math::MatrixT<std::__1::complex<double>, int>>(x=0x000060000026ad00) at checked_delete.hpp:36:5
    frame #30: 0x00000001001651ec dsf.x`boost::detail::sp_counted_impl_p<gridpack::math::MatrixT<std::__1::complex<double>, int>>::dispose(this=0x00006000002358c0) at sp_counted_impl.hpp:89:9
    frame #31: 0x000000010000acac dsf.x`boost::detail::sp_counted_base::release(this=0x00006000002358c0) at sp_counted_base_gcc_atomic.hpp:120:13
    frame #32: 0x000000010000ac58 dsf.x`boost::detail::shared_count::~shared_count(this=0x000000016fdfee58) at shared_count.hpp:432:29
    frame #33: 0x000000010000ac08 dsf.x`boost::detail::shared_count::~shared_count(this=0x000000016fdfee58) at shared_count.hpp:431:5
    frame #34: 0x0000000100049ba0 dsf.x`boost::shared_ptr<gridpack::math::MatrixT<std::__1::complex<double>, int>>::~shared_ptr(this=0x000000016fdfee50) at shared_ptr.hpp:335:25
    frame #35: 0x000000010001e098 dsf.x`boost::shared_ptr<gridpack::math::MatrixT<std::__1::complex<double>, int>>::~shared_ptr(this=0x000000016fdfee50) at shared_ptr.hpp:335:25
    frame #36: 0x000000010001f350 dsf.x`gridpack::dynamic_simulation::DSFullApp::~DSFullApp(this=0x000000016fdfe8e0) at dsf_app_module.cpp:101:1
    frame #37: 0x0000000100020134 dsf.x`gridpack::dynamic_simulation::DSFullApp::~DSFullApp(this=0x000000016fdfe8e0) at dsf_app_module.cpp:100:1
    frame #38: 0x000000010000a050 dsf.x`main(argc=2, argv=0x000000016fdff5c0) at dsf_main.cpp:124:3
    frame #39: 0x00000001a9abbf28 dyld`start + 2236
abhyshr commented 1 year ago

I get the same error for the 240-bus system as well.

abhyshr commented 1 year ago

Pushed a fix to fix/testing branch that resolves the memory corruption issue in release mode. The issue was that the method getAngle() is messed up. First off, it is NOT defined for the new added models such as REGCA1 generator models. It is declared as a virtual method in the base generator class. So, even if REGCA1 does not define getAngle(), the base class method should be picked up. Which does not happen in the release mode. Secondly, it is also not correctly implemented since it only returns the angle of the first generator at a bus. I think the getAngle() method is used at a number of locations so this should be fixed.

To fix the issue, I have turned off the securityCheck method called in the dynamics_simulation application which calls the getAngle() method. In my opinion, the securityCheck should be OFF by default and only called when the user requests it through set option.

@wperkins : Can you please retest and see if you get the same error.

wperkins commented 1 year ago

@wperkins : Can you please retest and see if you get the same error.

I'll check it out. Thanks.

wperkins commented 1 year ago

For the 145 bus case, the memory error is when some matrix is not being freed. Here's the back trace.

@abhyshr, on your Mac, did you build Debug or Release and what PETSc version did you use. I'd like to try building on my Mac. Thanks.

abhyshr commented 1 year ago

Release version. Used PETSc 3.20

wperkins commented 1 year ago

Pushed a fix to fix/testing branch that resolves the memory corruption issue in release mode. [...]

@wperkins : Can you please retest and see if you get the same error.

These changes fixed the 240-bus smoke tests for me. The only test failing for me now is the parallel 145-bus case. I'll try to follow @abhyshr's clue above on my Mac.

wperkins commented 1 year ago

Pushed a fix to fix/testing branch that resolves the memory corruption issue in release mode. [...] @wperkins : Can you please retest and see if you get the same error.

These changes fixed the 240-bus smoke tests for me. The only test failing for me now is the parallel 145-bus case. I'll try to follow @abhyshr's clue above on my Mac.

Currently, on Ubuntu with fix/testing the 145-bus dynamic simulation smoke test fails for me seemingly at random with both Debug and Release builds and with PETSc built complex or real. I can make the 145-bus case pass if I use the MUMPS solver instead of SuperLU_dist. This kind of thing may have been observed by other PETSc users. Using MUMPS as the GridPACK unit test parallel direct solver is supposed to happen automatically if PETSc is built without SuperLU_dist (and with MUMPS, of course).

wperkins commented 1 year ago

I think this has been addressed as best we can with #164. I'm sure we will run into it again, though. @jacksavage, I suggest that any CI (#173) use PETSc with MUMPS and without SuperLU_dist.