geodynamics / pylith

PyLith is a finite element code for the solution of dynamic and quasi-static tectonic deformation problems.
Other
154 stars 98 forks source link

Output results in accumulation of memory use #280

Closed baagaard-usgs closed 3 years ago

baagaard-usgs commented 3 years ago

Our current implementation of output (specifically, extracting a subfield from a field with many subfields) results in accumulation of memory usage.

The appropriate solution is to clean up the output implementation.

  1. Get the local vector from the entire field.
  2. Follow the steps in Lines 133-146 of PETSc plexhdf5.c to create a global vector without constraints.
  3. Use DMCreateSubDM() to create a PetscDM and a PetscIS for the subfield we want to extract.
  4. Use VecISCopy() to get a copy of the vector for the subfield. By getting a copy we can dimensionalize it, project it, etc.
baagaard-usgs commented 3 years ago

TODO

baagaard-usgs commented 3 years ago

Resolved: Memory errors when using FieldFilterProject on boundary.

Getting the solution at a point results in NaN values. Valgrind reports invalid reads. The coefficients array at line 215 in plexproject.c from DMPlexVecGetClosure() has the NaN values at some locations in the array. I don't know if this is a PETSc error or a problem setting up the PetscDM and PetscVec objects for output.

Resolution

The error was the discretization for the domain was used for the boundary. We need to set the dimension of the discretization to the dimension of the boundary. The proper fix is to project from the domain to the desired basis order for the boundary.

How to reproduce

make install
cd tests/fullscale/linearelasticity/nofaults-2d
valgrind --trace-children=yes --suppressions=/opt/pylith/src/pylith/share/valgrind-python.supp pylith gravity.cfg gravity_quad.cfg
==8564== Invalid read of size 8
==8564==    at 0x87A9733: PetscSpaceEvaluate_Polynomial (spacepoly.c:217)
==8564==    by 0x87B149A: PetscSpaceEvaluate (space.c:514)
==8564==    by 0x88627DD: PetscFECreateTabulation_Basic (febasic.c:149)
==8564==    by 0x8874FF0: PetscFECreateTabulation (fe.c:986)
==8564==    by 0x8B62655: DMProjectLocal_Generic_Plex (plexproject.c:659)
==8564==    by 0x8B64EFE: DMProjectFieldLocal_Plex (plexproject.c:869)
==8564==    by 0x8C23D52: DMProjectFieldLocal (dm.c:8753)
==8564==    by 0x8FDE671: DMProjectField (dmproject.c:237)
==8564==    by 0xA410C28: pylith::meshio::FieldFilterProject::apply(_p_Vec**, _p_DM*, _p_Vec*) const (FieldFilterProject.cc:110)
==8564==    by 0xA45F248: pylith::meshio::OutputSubfield::extract(_p_Vec* const&) (OutputSubfield.cc:124)
==8564==    by 0xA477E19: pylith::meshio::OutputPhysics::_writeDataStep(double, int, pylith::topology::Field const&) (OutputPhysics.cc:342)
==8564==    by 0xA473BAC: pylith::meshio::OutputPhysics::update(double, int, pylith::topology::Field const&, bool) (OutputPhysics.cc:195)
==8564==  Address 0x186dac50 is 0 bytes after a block of size 16 alloc'd
==8564==    at 0x483E340: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==8564==    by 0x7F93CD0: PetscMallocAlign (mal.c:54)
==8564==    by 0x7F94E1A: PetscMallocA (mal.c:423)
==8564==    by 0x8B5D52F: PetscDualSpaceGetAllPointsUnion (plexproject.c:445)
==8564==    by 0x8B623EE: DMProjectLocal_Generic_Plex (plexproject.c:650)
==8564==    by 0x8B64EFE: DMProjectFieldLocal_Plex (plexproject.c:869)
==8564==    by 0x8C23D52: DMProjectFieldLocal (dm.c:8753)
==8564==    by 0x8FDE671: DMProjectField (dmproject.c:237)
==8564==    by 0xA410C28: pylith::meshio::FieldFilterProject::apply(_p_Vec**, _p_DM*, _p_Vec*) const (FieldFilterProject.cc:110)
==8564==    by 0xA45F248: pylith::meshio::OutputSubfield::extract(_p_Vec* const&) (OutputSubfield.cc:124)
==8564==    by 0xA477E19: pylith::meshio::OutputPhysics::_writeDataStep(double, int, pylith::topology::Field const&) (OutputPhysics.cc:342)
==8564==    by 0xA473BAC: pylith::meshio::OutputPhysics::update(double, int, pylith::topology::Field const&, bool) (OutputPhysics.cc:195)
==8564==
==8564== Invalid read of size 8
==8564==    at 0x887FFE8: PetscFEEvaluateFieldJets_Internal (fe.c:2051)
==8564==    by 0x8B5F6A4: DMProjectPoint_Field_Private (plexproject.c:267)
==8564==    by 0x8B5F6A4: DMProjectPoint_Private (plexproject.c:411)
==8564==    by 0x8B641CC: DMProjectLocal_Generic_Plex (plexproject.c:794)
==8564==    by 0x8B64EFE: DMProjectFieldLocal_Plex (plexproject.c:869)
==8564==    by 0x8C23D52: DMProjectFieldLocal (dm.c:8753)
==8564==    by 0x8FDE671: DMProjectField (dmproject.c:237)
==8564==    by 0xA410C28: pylith::meshio::FieldFilterProject::apply(_p_Vec**, _p_DM*, _p_Vec*) const (FieldFilterProject.cc:110)
==8564==    by 0xA45F248: pylith::meshio::OutputSubfield::extract(_p_Vec* const&) (OutputSubfield.cc:124)
==8564==    by 0xA477E19: pylith::meshio::OutputPhysics::_writeDataStep(double, int, pylith::topology::Field const&) (OutputPhysics.cc:342)
==8564==    by 0xA473BAC: pylith::meshio::OutputPhysics::update(double, int, pylith::topology::Field const&, bool) (OutputPhysics.cc:195)
==8564==    by 0xA4BC8EE: pylith::problems::ObserversPhysics::notifyObservers(double, int, pylith::topology::Field const&, bool) (ObserversPhysics.cc:152)
==8564==    by 0xA2DA4DB: pylith::feassemble::PhysicsImplementation::notifyObservers(double, int, pylith::topology::Field const&) (PhysicsImplementation.cc:95)
==8564==  Address 0x18982270 is 48 bytes inside a block of size 168 alloc'd
==8564==    at 0x483E340: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==8564==    by 0x7F93CD0: PetscMallocAlign (mal.c:54)
==8564==    by 0x8BF3B7A: DMGetWorkArray (dm.c:1484)
==8564==    by 0x89B285F: DMPlexGetTransitiveClosure (plex.c:2525)
==8564==    by 0x89BF651: DMPlexGetCompressedClosure (plex.c:4792)
==8564==    by 0x89C000E: DMPlexVecGetClosure (plex.c:4979)
==8564==    by 0x8B5E8F8: DMProjectPoint_Field_Private (plexproject.c:215)
==8564==    by 0x8B5E8F8: DMProjectPoint_Private (plexproject.c:411)
==8564==    by 0x8B641CC: DMProjectLocal_Generic_Plex (plexproject.c:794)
==8564==    by 0x8B64EFE: DMProjectFieldLocal_Plex (plexproject.c:869)
==8564==    by 0x8C23D52: DMProjectFieldLocal (dm.c:8753)
==8564==    by 0x8FDE671: DMProjectField (dmproject.c:237)
==8564==    by 0xA410C28: pylith::meshio::FieldFilterProject::apply(_p_Vec**, _p_DM*, _p_Vec*) const (FieldFilterProject.cc:110)
==8564==
==8564== Invalid read of size 8
==8564==    at 0x888001A: PetscFEEvaluateFieldJets_Internal (fe.c:2052)
==8564==    by 0x8B5F6A4: DMProjectPoint_Field_Private (plexproject.c:267)
==8564==    by 0x8B5F6A4: DMProjectPoint_Private (plexproject.c:411)
==8564==    by 0x8B641CC: DMProjectLocal_Generic_Plex (plexproject.c:794)
==8564==    by 0x8B64EFE: DMProjectFieldLocal_Plex (plexproject.c:869)
==8564==    by 0x8C23D52: DMProjectFieldLocal (dm.c:8753)
==8564==    by 0x8FDE671: DMProjectField (dmproject.c:237)
==8564==    by 0xA410C28: pylith::meshio::FieldFilterProject::apply(_p_Vec**, _p_DM*, _p_Vec*) const (FieldFilterProject.cc:110)
==8564==    by 0xA45F248: pylith::meshio::OutputSubfield::extract(_p_Vec* const&) (OutputSubfield.cc:124)
==8564==    by 0xA477E19: pylith::meshio::OutputPhysics::_writeDataStep(double, int, pylith::topology::Field const&) (OutputPhysics.cc:342)
==8564==    by 0xA473BAC: pylith::meshio::OutputPhysics::update(double, int, pylith::topology::Field const&, bool) (OutputPhysics.cc:195)
==8564==    by 0xA4BC8EE: pylith::problems::ObserversPhysics::notifyObservers(double, int, pylith::topology::Field const&, bool) (ObserversPhysics.cc:152)
==8564==    by 0xA2DA4DB: pylith::feassemble::PhysicsImplementation::notifyObservers(double, int, pylith::topology::Field const&) (PhysicsImplementation.cc:95)
==8564==  Address 0x18982270 is 48 bytes inside a block of size 168 alloc'd
==8564==    at 0x483E340: memalign (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==8564==    by 0x7F93CD0: PetscMallocAlign (mal.c:54)
==8564==    by 0x8BF3B7A: DMGetWorkArray (dm.c:1484)
==8564==    by 0x89B285F: DMPlexGetTransitiveClosure (plex.c:2525)
==8564==    by 0x89BF651: DMPlexGetCompressedClosure (plex.c:4792)
==8564==    by 0x89C000E: DMPlexVecGetClosure (plex.c:4979)
==8564==    by 0x8B5E8F8: DMProjectPoint_Field_Private (plexproject.c:215)
==8564==    by 0x8B5E8F8: DMProjectPoint_Private (plexproject.c:411)
==8564==    by 0x8B641CC: DMProjectLocal_Generic_Plex (plexproject.c:794)
==8564==    by 0x8B64EFE: DMProjectFieldLocal_Plex (plexproject.c:869)
==8564==    by 0x8C23D52: DMProjectFieldLocal (dm.c:8753)
==8564==    by 0x8FDE671: DMProjectField (dmproject.c:237)
==8564==    by 0xA410C28: pylith::meshio::FieldFilterProject::apply(_p_Vec**, _p_DM*, _p_Vec*) const (FieldFilterProject.cc:110)
baagaard-usgs commented 3 years ago

Resolved: PETSc memory error in fault MMS test

It looks like DMCreateDS() on the output DM in the case of a fault leads to a double free of the DS for hybrid cells. My guess is that there is a copy of the hybrid DS in the output DM without incrementing the reference counter.

Resolution

Delete DMCreateDS() after DMGetOutputDM() in Field::createOutputVector(). DS already exists and DMCreateDS() cannot be called multiple times.

How to reproduce

make install -C libsrc
make check -C tests/mmstests/faults

pylith::mmstests::TestFaultKin2D_RigidBlocksStatic_TriP1::testDiscretization [0]PETSC ERROR: --------------------- Error Message -------------------------------------------------------------- [0]PETSC ERROR: Corrupt argument: https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind [0]PETSC ERROR: Object already free: Parameter # 1 [0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting. [0]PETSC ERROR: Petsc Development GIT revision: v3.15.0-487-g55766101d4d GIT Date: 2021-05-13 10:56:00 -0400 [0]PETSC ERROR: .libs/test_faultkin on a arch-clang-3.6.0-py38_debug named igskcicgltgm067 by baagaard Wed May 26 21:17:54 2021 [0]PETSC ERROR: Configure options --PETSC_ARCH=arch-clang-3.6.0-py38_debug --with-debugging=1 --with-clanguage=c --with-mpi-compilers=1 --with-shared-libraries=1 --with-64-bit-points=1 --with-large-file-io=1 --with-lgrind=0 --download-chaco=1 --download-parmetis=1 --download-metis=1 --download-triangle --with-blas-lib=/System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libBLAS.dylib --with-lapack-lib=/System/Library/Frameworks/Accelerate.framework/Frameworks/vecLib.framework/Versions/Current/libLAPACK.dylib --download-ml=1 --download-superlu=1 --with-fc=0 --with-hdf5=1 --with-hdf5-include=/Users/baagaard/tools/unix/hdf5-1.12.0/clang-3.6.0/include --with-hdf5-lib=/Users/baagaard/tools/unix/hdf5-1.12.0/clang-3.6.0/lib/libhdf5.dylib --with-zlib=1 [0]PETSC ERROR: #1 PetscDSGetHybrid() at /Users/baagaard/tools/unix/petsc-dev/src/dm/dt/interface/dtds.c:732 [0]PETSC ERROR: #2 DMCreateDS() at /Users/baagaard/tools/unix/petsc-dev/src/dm/interface/dm.c:5658 [0]PETSC ERROR: #3 void pylith::topology::Field::createOutputVector()() at /Users/baagaard/src/cig/pylith/libsrc/pylith/topology/Field.cc:601 : error[0]PETSC ERROR: #4 PetscDSDestroy() at /Users/baagaard/tools/unix/petsc-dev/src/dm/dt/interface/dtds.c:540 [0]PETSC ERROR: #5 DMClearDS() at /Users/baagaard/tools/unix/petsc-dev/src/dm/interface/dm.c:5198 [0]PETSC ERROR: #6 DMDestroy() at /Users/baagaard/tools/unix/petsc-dev/src/dm/interface/dm.c:770 [0]PETSC ERROR: #7 DMDestroy() at /Users/baagaard/tools/unix/petsc-dev/src/dm/interface/dm.c:771 [0]PETSC ERROR: #8 void pylith::topology::Field::deallocate()() at /Users/baagaard/src/cig/pylith/libsrc/pylith/topology/Field.cc:149

baagaard-usgs commented 3 years ago

Q2 boundary to Q1 boundary projection fails

Projecting the solution from Q2 on the boundary to Q1 on the boundary fails. P2 to P1 works, so we suspect this is specific to projection with Q2 and Q1. Note this involves DMProjectField().

How to reproduce

PETSc branch: knepley/pylith PyLith branch: baagaard/feature-output-subfield

make install
cd tests/fullscale/linearelasticity/nofaults-2d
nemesis TestGravity.py
test_bcdirichlet_info (__main__.TestQuad) ... ok
test_bcdirichlet_solution (__main__.TestQuad) ... Error in component 1 of field 'displacement' at time step 0.
Total # not okay: 9
Ratio (maskR), # not okay: 8
Expected values (not okay): [-11.03248134 -11.20760009 -10.50712509  -9.63153133  -8.40570007
  -6.82963131  -4.90332504  -2.62678127]
Computed values (not okay): [-11.1638204  -10.8135829  -10.1131079   -9.06239539  -7.66144538
  -5.91025786  -3.80883284  -1.35717032]
Ratio (not okay): [0.01190476 0.03515625 0.0375     0.05909091 0.08854167 0.13461538
 0.22321429 0.48333333]
Ratio Coordinates (not okay): [[-4000.  3000.]
 [-4000.  4000.]
 [-4000.  2000.]
 [-4000.  1000.]
 [-4000.     0.]
 [-4000. -1000.]
 [-4000. -2000.]
 [-4000. -3000.]]
Tolerance Absolute Mask: 0.1
Ratio Tolerance: 1e-05
Relative Diff (maskD), # not okay: 1
Expected values (not okay): [0.]
Computed values (not okay): [-11.03248134]
Relative diff (not okay): [1.35483871]
Relative diff Coordinates (not okay): [[-4000. -4000.]]
Diff Tolerance: 0.000010
Scale: 8.1430e+00
FAIL
test_domain_solution (__main__.TestQuad) ... ok
test_material_info (__main__.TestQuad) ... ok
test_material_solution (__main__.TestQuad) ... ok
test_bcdirichlet_info (__main__.TestTri) ... ok
test_bcdirichlet_solution (__main__.TestTri) ... ok
test_domain_solution (__main__.TestTri) ... ok
test_material_info (__main__.TestTri) ... ok
test_material_solution (__main__.TestTri) ... ok

======================================================================
FAIL: test_bcdirichlet_solution (__main__.TestQuad)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "TestGravity.py", line 83, in test_bcdirichlet_solution
    check_data(filename, self,
  File "/Users/baagaard/tools/unix/pylith-develop/clang-3.6.0-py38/lib/python3.8/site-packages/pylith/testing/FullTestApp.py", line 262, in check_data
    checker.checkVertexField(field)
  File "/Users/baagaard/tools/unix/pylith-develop/clang-3.6.0-py38/lib/python3.8/site-packages/pylith/testing/FullTestApp.py", line 161, in checkVertexField
    self._checkField(fieldName, fieldE, field, vertices, mask)
  File "/Users/baagaard/tools/unix/pylith-develop/clang-3.6.0-py38/lib/python3.8/site-packages/pylith/testing/FullTestApp.py", line 239, in _checkField
    self.testcase.assertEqual(npts, numpy.sum(okay))
AssertionError: 9 != 0

----------------------------------------------------------------------
Ran 10 tests in 8.293s

FAILED (failures=1)