TDycores-Project / TDycore

BSD 2-Clause "Simplified" License
4 stars 0 forks source link

Re-establishing code coverage, updating Docker image and PETSc version #236

Closed jeff-cohere closed 2 years ago

jeff-cohere commented 2 years ago

We recently observed a serious drop in code coverage. This PR seeks to restore code coverage testing in some demos that are missing it, which requires us to adopt new compilers to avoid a known internal compiler error (see comments). Updating compilers means updating our Docker image, and while we're at it, we're also moving to the latest release of PETSc, which contains enhancements to DMPlex.

jeff-cohere commented 2 years ago

Aha, we've rediscovered the internal compiler error that caused us to disable code coverage testing for the MPFA-O transient drivers a while back. Now that GCC 11 is out, maybe we can update the compilers we're using and see if it goes away.

jeff-cohere commented 2 years ago

Okay, I've managed to set up a Docker image with the new PETSc and newer compilers. Everything builds and runs, but there seem to be a few failures in some regression tests. I haven't looked at this closely yet, but I can reproduce at least one of these failures locally. Will investigate.

jeff-cohere commented 2 years ago

Below are the regression test failures I've reproduced. @bishtgautam , it's possible that the new PETSc requires some more code changes--for example, the DMPlexCreateFromFile function now has an additional argument.

Transient

$ cd demo/transient
$ ./transient_snes_mpfaof90 -mesh_filename ../../share/meshes/3x3_quad_surface_mesh.exo -dm_plex_extrude_layers 3 -dm_plex_extrude_normal 0.0,0.0,1.0 -tdy_mpfao_gmatrix_method  MPFAO_GMATRIX_TPF -tdy_mpfao_boundary_condition_type DIRICHLET_BC -tdy_water_density EXPONENTIAL -snes_monitor -tdy_regression_test -tdy_regression_test_num_cells_per_process 10 -tdy_regression_test_filename transient-snes-mpfaof90-dmplex-extrude
+++++++++++++++++ TDycore +++++++++++++++++
Creating TDycore
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Argument out of range
[0]PETSC ERROR: Number of thicknesses -1 must be positive
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.17.2, unknown 
[0]PETSC ERROR: ./transient_snes_mpfaof90 on a debug named teeny by jeff Tue Jun 28 18:31:04 2022
[0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --with-cxx-dialect=C++14 --with-mpiexec=mpiexec --with-debugging=1 --with-shared-libraries=1 --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-zlib --with-shared-libraries=1
[0]PETSC ERROR: #1 DMPlexTransformExtrudeSetThicknesses() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/transform/impls/extrude/plextrextrude.c:899
[0]PETSC ERROR: #2 DMPlexExtrude() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexextrude.c:67
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: The EXACT line numbers in the error traceback are not available.
[0]PETSC ERROR: instead the line number of the start of the function is given.
[0]PETSC ERROR: #1 TDySetFromOptions() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:611
[0]PETSC ERROR: #3 User provided function() at unknown file:0
[0]PETSC ERROR: Checking the memory for corruption.
No error traceback is available, the problem could be in the main program. 
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Richards

$ cd demo/richards
$ mpiexec -np 4 ./richards_driver -dm_plex_simplex 0 -dm_plex_dim 3 -dm_plex_box_faces 3,3,3 -dm_plex_box_lower 0,0,0 -dm_plex_box_upper 1,1,1 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename richards-driver-ts-prob1-np4 -tdy_final_time 3.1536e3 -tdy_dt_max 600. -tdy_dt_growth_factor 1.5 -tdy_timers -tdy_init_with_random_field -tdy_time_integration_method TS
+++++++++++++++++ TDycore +++++++++++++++++
Creating TDycore
Beginning Richards Driver simulation.
TDycore setup
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR:   
[0]PETSC ERROR: Did not find a corresponding cell below the given cell
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.17.2, unknown 
[0]PETSC ERROR: ./richards_driver on a debug named teeny by jeff Tue Jun 28 18:27:46 2022
[0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --with-cxx-dialect=C++14 --with-mpiexec=mpiexec --with-debugging=1 --with-shared-libraries=1 --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-zlib --with-shared-libraries=1
[0]PETSC ERROR: #1 RearrangeCellsInListAsNeighbors() at /home/jeff/projects/pnnl/TDycore/src/fv/share/tdymeshplex.c:2212
[0]PETSC ERROR: #2 RearrangeCellsInAntiClockwiseDir() at /home/jeff/projects/pnnl/TDycore/src/fv/share/tdymeshplex.c:2250
[0]PETSC ERROR: #3 DetermineUpwindFacesForSubcell_PlanarVerticalFaces() at /home/jeff/projects/pnnl/TDycore/src/fv/share/tdymeshplex.c:2653
[0]PETSC ERROR: #4 SetupSubcells() at /home/jeff/projects/pnnl/TDycore/src/fv/share/tdymeshplex.c:3050
[0]PETSC ERROR: #5 TDyMeshCreateFromPlex() at /home/jeff/projects/pnnl/TDycore/src/fv/share/tdymeshplex.c:3774
[0]PETSC ERROR: #6 ComputeAinvB() at /home/jeff/projects/pnnl/TDycore/src/fv/mpfao/tdympfao.c:1015
[0]PETSC ERROR: #7 ComputeTransmissibilityMatrix_ForNonCornerVertex() at /home/jeff/projects/pnnl/TDycore/src/fv/mpfao/tdympfao.c:1260
[0]PETSC ERROR: #8 ComputeTransmissibilityMatrix() at /home/jeff/projects/pnnl/TDycore/src/fv/mpfao/tdympfao.c:1728
[0]PETSC ERROR: #9 TDySetup_Richards_MPFAO() at /home/jeff/projects/pnnl/TDycore/src/fv/mpfao/tdympfao.c:2226
[0]PETSC ERROR: #10 TDySetup() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:735
[0]PETSC ERROR: #11 TDyDriverInitializeTDy() at /home/jeff/projects/pnnl/TDycore/src/tdydriver.c:83
[0]PETSC ERROR: #12 main() at /home/jeff/projects/pnnl/TDycore/demo/richards/richards_driver.c:33
[0]PETSC ERROR: PETSc Option Table entries:
[0]PETSC ERROR: -dm_plex_box_faces 3,3,3
[0]PETSC ERROR: -dm_plex_box_lower 0,0,0
[0]PETSC ERROR: -dm_plex_box_upper 1,1,1
[0]PETSC ERROR: -dm_plex_dim 3
[0]PETSC ERROR: -dm_plex_simplex 0
[0]PETSC ERROR: -tdy_dt_growth_factor 1.5
[0]PETSC ERROR: -tdy_dt_max 600.
[0]PETSC ERROR: -tdy_final_time 3.1536e3
[0]PETSC ERROR: -tdy_init_with_random_field
[0]PETSC ERROR: -tdy_regression_test
[0]PETSC ERROR: -tdy_regression_test_filename richards-driver-ts-prob1-np4
[0]PETSC ERROR: -tdy_regression_test_num_cells_per_process 1
[0]PETSC ERROR: -tdy_time_integration_method TS
[0]PETSC ERROR: -tdy_timers
[0]PETSC ERROR: -tdy_water_density exponential
[0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint@mcs.anl.gov----------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 71.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

TH

$ cd demo/th
$ mpiexec -np 4 ./th_driver -dm_plex_simplex 0 -dm_plex_dim 3 -dm_plex_box_faces 2,2,2 -dm_plex_box_lower 0,0,0 -dm_plex_box_upper 1,1,1 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename th-driver-ts-prob1-np4 -tdy_final_time 3.1536e3 -tdy_dt_max 600. -tdy_dt_growth_factor 1.5 -tdy_timers -tdy_init_with_random_field -tdy_time_integration_method TS
+++++++++++++++++ TDycore +++++++++++++++++
Creating TDycore
Running TH mode.
Beginning TH Driver simulation.
TDycore setup
Using PETSc TS for time integration.
Creating Vectors
Creating Jacobian matrix
[1]PETSC ERROR: PetscTrFreeDefault() called from PetscSFLinkDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfpack.c:477
[1]PETSC ERROR: [2]PETSC ERROR: PetscTrFreeDefault() called from PetscSFLinkDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfpack.c:476
[2]PETSC ERROR: Block [id=23350(12)] at address 0x56001ff4c0a0 is corrupted (probably write past end of array)
[2]PETSC ERROR: Block allocated in PetscSFLinkCreate_MPI() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfmpi.c:189
[2]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[2]PETSC ERROR: Memory corruption: https://petsc.org/release/faq/#valgrind
[2]PETSC ERROR: Corrupted memory
[2]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[2]PETSC ERROR: Petsc Release Version 3.17.2, unknown 
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple MacOS to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: The EXACT line numbers in the error traceback are not available.
[0]PETSC ERROR: instead the line number of the start of the function is given.
Block [id=23440(10)] at address 0x55cf06d1c640 is corrupted (probably write past end of array)
[1]PETSC ERROR: Block allocated in PetscSFLinkCreate_MPI() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfmpi.c:200
[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: Memory corruption: https://petsc.org/release/faq/#valgrind
[1]PETSC ERROR: Corrupted memory
[1]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[1]PETSC ERROR: Petsc Release Version 3.17.2, unknown 
[1]PETSC ERROR: ./th_driver on a debug named teeny by jeff Tue Jun 28 18:23:53 2022
[1]PETSC ERROR: [2]PETSC ERROR: ./th_driver on a debug named teeny by jeff Tue Jun 28 18:23:53 2022
[2]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --with-cxx-dialect=C++14 --with-mpiexec=mpiexec --with-debugging=1 --with-shared-libraries=1 --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-zlib --with-shared-libraries=1
[2]PETSC ERROR: #1 PetscTrFreeDefault() at /home/jeff/projects/pnnl/petsc/src/sys/memory/mtr.c:306
[0]PETSC ERROR: #1 PetscTrFreeDefault() at /home/jeff/projects/pnnl/petsc/src/sys/memory/mtr.c:275
[0]PETSC ERROR: #2 PetscSFLinkDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfpack.c:464
[0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --with-cxx-dialect=C++14 --with-mpiexec=mpiexec --with-debugging=1 --with-shared-libraries=1 --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-zlib --with-shared-libraries=1
[1]PETSC ERROR: #1 PetscTrFreeDefault() at /home/jeff/projects/pnnl/petsc/src/sys/memory/mtr.c:306
[1]PETSC ERROR: #2 PetscSFLinkDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfpack.c:477
[1]PETSC ERROR: #3 PetscSFReset_Basic() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfbasic.c:109
[1]PETSC ERROR: #4 PetscSFReset() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/interface/sf.c:115
[1]PETSC ERROR: #5 PetscSFDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/interface/sf.c:229
[1]PETSC ERROR: #6 PetscSFCreateRemoteOffsets() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/utils/sfutils.c:335
[1]PETSC ERROR: #7 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:753
[1]PETSC ERROR: #8 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2544
[1]PETSC ERROR: #9 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1432
[1]PETSC ERROR: #10 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:166
#3 PetscSFReset_Basic() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfbasic.c:96
[0]PETSC ERROR: #4 PetscSFReset() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/interface/sf.c:113
[0]PETSC ERROR: #5 PetscSFDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/interface/sf.c:225
[0]PETSC ERROR: #6 PetscSFCreateRemoteOffsets() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/utils/sfutils.c:320
[0]PETSC ERROR: #7 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:722
[0]PETSC ERROR: #8 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2492
[0]PETSC ERROR: #9 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1426
[0]PETSC ERROR: #10 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:158
[0]PETSC ERROR: #11 TDyCreateJacobian() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:2028
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: [2]PETSC ERROR: #2 PetscSFLinkDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfpack.c:476
[2]PETSC ERROR: #3 PetscSFReset_Basic() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfbasic.c:109
[2]PETSC ERROR: #4 PetscSFReset() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/interface/sf.c:115
[2]PETSC ERROR: #5 PetscSFDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/interface/sf.c:229
[2]PETSC ERROR: #6 PetscSFCreateRemoteOffsets() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/utils/sfutils.c:335
[2]PETSC ERROR: #7 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:753
[2]PETSC ERROR: #8 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2544
[2]PETSC ERROR: #9 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1432
[2]PETSC ERROR: #10 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:166
[2]PETSC ERROR: #11 TDyCreateJacobian() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:2032
[2]PETSC ERROR: #12 TDyDriverInitializeTDy() at /home/jeff/projects/pnnl/TDycore/src/tdydriver.c:87
[2]PETSC ERROR: #13 main() at /home/jeff/projects/pnnl/TDycore/demo/th/th_driver.c:33
[2]PETSC ERROR: Reached the main program with an out-of-range error code 1. This should never happen
[2]PETSC ERROR: PETSc Option Table entries:
[1]PETSC ERROR: #11 TDyCreateJacobian() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:2032
[1]PETSC ERROR: #12 TDyDriverInitializeTDy() at /home/jeff/projects/pnnl/TDycore/src/tdydriver.c:87
[1]PETSC ERROR: #13 main() at /home/jeff/projects/pnnl/TDycore/demo/th/th_driver.c:33
[1]PETSC ERROR: Reached the main program with an out-of-range error code 1. This should never happen
[1]PETSC ERROR: PETSc Option Table entries:
[1]PETSC ERROR: -dm_plex_box_faces 2,2,2
[1]PETSC ERROR: -dm_plex_box_lower 0,0,0
[1]PETSC ERROR: -dm_plex_box_upper 1,1,1
[1]PETSC ERROR: -dm_plex_dim 3
[1]PETSC ERROR: -dm_plex_simplex 0
[1]PETSC ERROR: -tdy_dt_growth_factor 1.5
[1]PETSC ERROR: -tdy_dt_max 600.
[1]PETSC ERROR: -tdy_final_time 3.1536e3
[1]PETSC ERROR: -tdy_init_with_random_field
[1]PETSC ERROR: [2]PETSC ERROR: -dm_plex_box_faces 2,2,2
[2]PETSC ERROR: -dm_plex_box_lower 0,0,0
[2]PETSC ERROR: -dm_plex_box_upper 1,1,1
[2]PETSC ERROR: -dm_plex_dim 3
[2]PETSC ERROR: -dm_plex_simplex 0
[2]PETSC ERROR: -tdy_dt_growth_factor 1.5
[2]PETSC ERROR: -tdy_dt_max 600.
[2]PETSC ERROR: -tdy_final_time 3.1536e3
[2]PETSC ERROR: -tdy_init_with_random_field
[2]PETSC ERROR: -tdy_regression_test
[2]PETSC ERROR: -tdy_regression_test_filename th-driver-ts-prob1-np4
[2]PETSC ERROR: -tdy_regression_test_num_cells_per_process 1
[2]PETSC ERROR: -tdy_time_integration_method TS
[2]PETSC ERROR: -tdy_timers
[2]PETSC ERROR: -tdy_water_density exponential
[2]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint@mcs.anl.gov----------
Signal received
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.17.2, unknown 
[0]PETSC ERROR: ./th_driver on a debug named teeny by jeff Tue Jun 28 18:23:53 2022
[0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --with-cxx-dialect=C++14 --with-mpiexec=mpiexec --with-debugging=1 --with-shared-libraries=1 --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-zlib --with-shared-libraries=1
[0]PETSC ERROR: #1 User provided function() at unknown file:0
[0]PETSC ERROR: Checking the memory for corruption.
The EXACT line numbers in the error traceback are not available.
Instead the line number of the start of the function is given.
[0] #1 PetscSFLinkDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfpack.c:464
[0] #2 PetscSFReset_Basic() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/impls/basic/sfbasic.c:96
[0] #3 PetscSFReset() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/interface/sf.c:113
[0] #4 PetscSFDestroy() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/interface/sf.c:225
[0] #5 PetscSFCreateRemoteOffsets() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/utils/sfutils.c:320
[0] #6 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:722
[0] #7 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2492
[0] #8 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1426
[0] #9 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:158
[0] #10 TDyCreateJacobian() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:2028
-tdy_regression_test
[1]PETSC ERROR: -tdy_regression_test_filename th-driver-ts-prob1-np4
[1]PETSC ERROR: -tdy_regression_test_num_cells_per_process 1
[1]PETSC ERROR: -tdy_time_integration_method TS
[1]PETSC ERROR: -tdy_timers
[1]PETSC ERROR: -tdy_water_density exponential
[1]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint@mcs.anl.gov----------
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 59.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[teeny:02277] 2 more processes have sent help message help-mpi-api.txt / mpi-abort
[teeny:02277] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
bishtgautam commented 2 years ago

@jeff-cohere How about switching to v3.16.2? It is the same version that is also being used by PFLOTRAN at the moment and does not require code modification.

knepley commented 2 years ago

@jeff-cohere I can at least solve the TRANSIENT error right away. The Fortran wrapper is autogenerated, but this will not check for a NULL pointer in the thicknesses position. We have to make a custom wrapper to do this.

TH is a memory overwrite error. Can you run that in Valgrind and give me the output?

RICHARDS seems like a TDycore internal error.

bishtgautam commented 2 years ago

Okay let's try to fix the code for PETSc v3.17.2.

jeff-cohere commented 2 years ago

Okay, I'll run Valgrind on the TH test, too. Gautam, should we be running tests for the FV-TPF implementation? Or is it too soon for that?

jeff-cohere commented 2 years ago

Also, @knepley : thanks for the tip on the F90 DMPlexExtrude call. It looks like the arguments have changed for that function, too, so that's an easy fix. Happy to do it if you're not already working on it.

jeff-cohere commented 2 years ago

Here's the relevant part of what Valgrind has to say about the TH memory corruption error:

...
==1475371== Invalid write of size 4
==1475371==    at 0x72512C2: PetscSectionSetDof (section.c:802)
==1475371==    by 0x7C2DA59: DMPlexCreateAdjacencySection_Static (plexpreallocate.c:472)
==1475371==    by 0x7C3144C: DMPlexPreallocateOperator (plexpreallocate.c:771)
==1475371==    by 0x7BA3AE5: DMCreateMatrix_Plex (plex.c:2544)
==1475371==    by 0x7E3BB30: DMCreateMatrix (dm.c:1432)
==1475371==    by 0x490EFC8: TDyDiscretizationCreateJacobianMatrix (tdydiscretization.c:166)
==1475371==    by 0x490E513: TDyCreateJacobian (tdycore.c:2032)
==1475371==    by 0x49112F3: TDyDriverInitializeTDy (tdydriver.c:87)
==1475371==    by 0x109AB5: main (th_driver.c:33)
==1475371==  Address 0x2194c93c is 12 bytes after a block of size 32 free'd
==1475371==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1475371==    by 0x9B130D6: gk_free (memory.c:208)
==1475371==    by 0x9B3F3C9: libmetis__rpqDestroy (gklib.c:34)
==1475371==    by 0x9B398B3: libmetis__FM_2WayCutRefine (fm.c:197)
==1475371==    by 0x9B38300: libmetis__FM_2WayRefine (fm.c:20)
==1475371==    by 0x9B45CFD: libmetis__GrowBisection (initpart.c:298)
==1475371==    by 0x9B45282: libmetis__Init2WayPartition (initpart.c:48)
==1475371==    by 0x9B668F3: libmetis__MultilevelBisect (pmetis.c:243)
==1475371==    by 0x9B66535: libmetis__MlevelRecursiveBisection (pmetis.c:183)
==1475371==    by 0x9B66782: libmetis__MlevelRecursiveBisection (pmetis.c:209)
==1475371==    by 0x9B6632F: METIS_PartGraphRecursive (pmetis.c:133)
==1475371==    by 0x9B47496: libmetis__InitKWayPartitioning (kmetis.c:194)
==1475371==  Block was alloc'd at
==1475371==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1475371==    by 0x9B12E16: gk_malloc (memory.c:147)
==1475371==    by 0x9B3F23C: libmetis__rpqCreate (gklib.c:34)
==1475371==    by 0x9B385B9: libmetis__FM_2WayCutRefine (fm.c:62)
==1475371==    by 0x9B38300: libmetis__FM_2WayRefine (fm.c:20)
==1475371==    by 0x9B45CFD: libmetis__GrowBisection (initpart.c:298)
==1475371==    by 0x9B45282: libmetis__Init2WayPartition (initpart.c:48)
==1475371==    by 0x9B668F3: libmetis__MultilevelBisect (pmetis.c:243)
==1475371==    by 0x9B66535: libmetis__MlevelRecursiveBisection (pmetis.c:183)
==1475371==    by 0x9B66782: libmetis__MlevelRecursiveBisection (pmetis.c:209)
==1475371==    by 0x9B6632F: METIS_PartGraphRecursive (pmetis.c:133)
==1475371==    by 0x9B47496: libmetis__InitKWayPartitioning (kmetis.c:194)
==1475371== 
==1475371== Invalid read of size 4
==1475371==    at 0x7251362: PetscSectionAddDof (section.c:826)
==1475371==    by 0x7C2DE85: DMPlexCreateAdjacencySection_Static (plexpreallocate.c:490)
==1475371==    by 0x7C3144C: DMPlexPreallocateOperator (plexpreallocate.c:771)
==1475371==    by 0x7BA3AE5: DMCreateMatrix_Plex (plex.c:2544)
==1475371==    by 0x7E3BB30: DMCreateMatrix (dm.c:1432)
==1475371==    by 0x490EFC8: TDyDiscretizationCreateJacobianMatrix (tdydiscretization.c:166)
==1475371==    by 0x490E513: TDyCreateJacobian (tdycore.c:2032)
==1475371==    by 0x49112F3: TDyDriverInitializeTDy (tdydriver.c:87)
==1475371==    by 0x109AB5: main (th_driver.c:33)
==1475371==  Address 0x2194c91c is 12 bytes inside a block of size 32 free'd
...
jeff-cohere commented 2 years ago

My fix to the transient extrusion problem got it running again, but the regression test still fails because of differences from our baselines:

----------------------------------------
transient-snes-mpfaof90-dmplex-extrude...
    cd /home/jeff/projects/pnnl/TDycore/demo/transient
    /home/jeff/projects/pnnl/TDycore/demo/transient/transient_snes_mpfaof90 -ma
    # transient-snes-mpfaof90-dmplex-extrude : run time : 1.20 seconds
    diff transient-snes-mpfaof90-dmplex-extrude.regression.gold transient-snes-
    FAIL: Liquid Pressure:Max : 0.05165348704928539 > 1e-12 [relative]
    FAIL: Liquid Pressure:Min : 0.0012106965185768362 > 1e-12 [relative]
    FAIL: Liquid Pressure:Mean : 0.016393590416747494 > 1e-12 [relative]
    FAIL: Liquid Pressure:0 : 0.0012106967574871913 > 1e-12 [relative]
    FAIL: Liquid Pressure:2 : 0.031121615192353066 > 1e-12 [relative]
    FAIL: Liquid Pressure:4 : 0.010385206661480422 > 1e-12 [relative]
    FAIL: Liquid Pressure:6 : 0.0012106966381568152 > 1e-12 [relative]
    FAIL: Liquid Pressure:8 : 0.03112161384536345 > 1e-12 [relative]
    FAIL: Liquid Pressure:10 : 0.010385206661422118 > 1e-12 [relative]
    FAIL: Liquid Pressure:12 : 0.0035453382620971004 > 1e-12 [relative]
    FAIL: Liquid Pressure:14 : 0.05165348704928539 > 1e-12 [relative]
    FAIL: Liquid Pressure:16 : 0.010385205777137793 > 1e-12 [relative]
    FAIL: Liquid Pressure:18 : 0.001210696637877151 > 1e-12 [relative]
transient-snes-mpfaof90-dmplex-extrude... failed.
----------------------------------------

I might not have transliterated the arguments correctly to DMPlexExtrude in that demo. @bishtgautam , can you take a look and see if it's doing what you think it's supposed to be doing? Here's the documentation for the new DMPlexExtrude function.

If this isn't an easy fix, I'll disable that test too to check on coverage.

bishtgautam commented 2 years ago

@jeff-cohere I'm installing v3.17.2 on my machine and will work on debugging these failures later today.

knepley commented 2 years ago

@jeff-cohere I'm installing v3.17.2 on my machine and will work on debugging these failures later today.

Thanks for fixing this Jeff. If you need me to push that into PETSc, let me know.

knepley commented 2 years ago

Here's the relevant part of what Valgrind has to say about the TH memory corruption error:

...
==1475371== Invalid write of size 4
==1475371==    at 0x72512C2: PetscSectionSetDof (section.c:802)
==1475371==    by 0x7C2DA59: DMPlexCreateAdjacencySection_Static (plexpreallocate.c:472)
==1475371==    by 0x7C3144C: DMPlexPreallocateOperator (plexpreallocate.c:771)
==1475371==    by 0x7BA3AE5: DMCreateMatrix_Plex (plex.c:2544)
==1475371==    by 0x7E3BB30: DMCreateMatrix (dm.c:1432)
==1475371==    by 0x490EFC8: TDyDiscretizationCreateJacobianMatrix (tdydiscretization.c:166)
==1475371==    by 0x490E513: TDyCreateJacobian (tdycore.c:2032)
==1475371==    by 0x49112F3: TDyDriverInitializeTDy (tdydriver.c:87)
==1475371==    by 0x109AB5: main (th_driver.c:33)
==1475371==  Address 0x2194c93c is 12 bytes after a block of size 32 free'd
==1475371==    at 0x484B27F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1475371==    by 0x9B130D6: gk_free (memory.c:208)
==1475371==    by 0x9B3F3C9: libmetis__rpqDestroy (gklib.c:34)
==1475371==    by 0x9B398B3: libmetis__FM_2WayCutRefine (fm.c:197)
==1475371==    by 0x9B38300: libmetis__FM_2WayRefine (fm.c:20)
==1475371==    by 0x9B45CFD: libmetis__GrowBisection (initpart.c:298)
==1475371==    by 0x9B45282: libmetis__Init2WayPartition (initpart.c:48)
==1475371==    by 0x9B668F3: libmetis__MultilevelBisect (pmetis.c:243)
==1475371==    by 0x9B66535: libmetis__MlevelRecursiveBisection (pmetis.c:183)
==1475371==    by 0x9B66782: libmetis__MlevelRecursiveBisection (pmetis.c:209)
==1475371==    by 0x9B6632F: METIS_PartGraphRecursive (pmetis.c:133)
==1475371==    by 0x9B47496: libmetis__InitKWayPartitioning (kmetis.c:194)
==1475371==  Block was alloc'd at
==1475371==    at 0x4848899: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1475371==    by 0x9B12E16: gk_malloc (memory.c:147)
==1475371==    by 0x9B3F23C: libmetis__rpqCreate (gklib.c:34)
==1475371==    by 0x9B385B9: libmetis__FM_2WayCutRefine (fm.c:62)
==1475371==    by 0x9B38300: libmetis__FM_2WayRefine (fm.c:20)
==1475371==    by 0x9B45CFD: libmetis__GrowBisection (initpart.c:298)
==1475371==    by 0x9B45282: libmetis__Init2WayPartition (initpart.c:48)
==1475371==    by 0x9B668F3: libmetis__MultilevelBisect (pmetis.c:243)
==1475371==    by 0x9B66535: libmetis__MlevelRecursiveBisection (pmetis.c:183)
==1475371==    by 0x9B66782: libmetis__MlevelRecursiveBisection (pmetis.c:209)
==1475371==    by 0x9B6632F: METIS_PartGraphRecursive (pmetis.c:133)
==1475371==    by 0x9B47496: libmetis__InitKWayPartitioning (kmetis.c:194)
==1475371== 
==1475371== Invalid read of size 4
==1475371==    at 0x7251362: PetscSectionAddDof (section.c:826)
==1475371==    by 0x7C2DE85: DMPlexCreateAdjacencySection_Static (plexpreallocate.c:490)
==1475371==    by 0x7C3144C: DMPlexPreallocateOperator (plexpreallocate.c:771)
==1475371==    by 0x7BA3AE5: DMCreateMatrix_Plex (plex.c:2544)
==1475371==    by 0x7E3BB30: DMCreateMatrix (dm.c:1432)
==1475371==    by 0x490EFC8: TDyDiscretizationCreateJacobianMatrix (tdydiscretization.c:166)
==1475371==    by 0x490E513: TDyCreateJacobian (tdycore.c:2032)
==1475371==    by 0x49112F3: TDyDriverInitializeTDy (tdydriver.c:87)
==1475371==    by 0x109AB5: main (th_driver.c:33)
==1475371==  Address 0x2194c91c is 12 bytes inside a block of size 32 free'd
...

@jeff-cohere This definitely looks like a PETSc bug in the preallocation, but it is incredibly strange that it would happen there. There are explicit bounds checks in the code. Are you running in optimized mode?

jeff-cohere commented 2 years ago

Thanks for fixing this Jeff. If you need me to push that into PETSc, let me know.

To be clear, I just reordered the parameters in the function call within the demo. I haven't looked at PETSc's generated Fortran code (but wouldn't be opposed to helping validate arguments if there's a mechanism for doing that).

@jeff-cohere This definitely looks like a PETSc bug in the preallocation, but it is incredibly strange that it would happen there. There are explicit bounds checks in the code. Are you running in optimized mode?

I believe so. The PETSc I ran this against was configured thus:

#!/usr/bin/python3
if __name__ == '__main__':
  import sys
  import os
  sys.path.insert(0, os.path.abspath('config'))
  import configure
  configure_options = [
    '--CFLAGS=-g -O0',
    '--CXXFLAGS=-g -O0',
    '--FFLAGS=-g -O0 -Wno-unused-function',
    '--download-exodusii',
    '--download-fblaslapack',
    '--download-hdf5',
    '--download-metis',
    '--download-netcdf',
    '--download-parmetis',
    '--download-pnetcdf',
    '--with-mpich-pkg-config=/usr',
    '--with-clanguage=c',
    '--with-debugging=0',
    '--with-shared-libraries=1',
    '--with-zlib',
    'PETSC_ARCH=opt',
  ]
  configure.petsc_configure(configure_options)

I can also build a debug version and get you a traceback for that.

jeff-cohere commented 2 years ago

Also, I see there's just now a new release (v3.17.3). Should we be using that, to make sure we're getting the latest and greatest?

jeff-cohere commented 2 years ago

Here's the business end of Valgrind's report on the TH error (debug edition):

==1783749== Invalid write of size 4
==1783749==    at 0x75A4C79: UnpackAndInsert_PetscInt_1_1 (sfpack.c:374)
==1783749==    by 0x7714DB0: PetscSFLinkUnpackLeafData_Private (sfpack.c:1094)
==1783749==    by 0x7715C3A: PetscSFLinkUnpackLeafData (sfpack.c:1124)
==1783749==    by 0x759E1FC: PetscSFBcastEnd_Basic (sfbasic.c:212)
==1783749==    by 0x756A7E7: PetscSFBcastEnd (sf.c:1472)
==1783749==    by 0x7593811: PetscSFCreateRemoteOffsets (sfutils.c:334)
==1783749==    by 0x85124AD: DMPlexPreallocateOperator (plexpreallocate.c:753)
==1783749==    by 0x842B769: DMCreateMatrix_Plex (plex.c:2544)
==1783749==    by 0x82E5D11: DMCreateMatrix (dm.c:1432)
==1783749==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscreti
==1783749==    by 0x4977092: TDyCreateJacobian (tdycore.c:2032)
==1783749==    by 0x497CC0D: TDyDriverInitializeTDy (tdydriver.c:87)
==1783749==  Address 0x25b4afac is 8 bytes after a block of size 1,652 alloc'd
==1783749==    at 0x484E120: memalign (in /usr/libexec/valgrind/vgpreload_memch
==1783749==    by 0x744055C: PetscMallocAlign (mal.c:48)
==1783749==    by 0x74446C9: PetscTrMallocDefault (mtr.c:183)
==1783749==    by 0x7442663: PetscMallocA (mal.c:414)
==1783749==    by 0x75936CB: PetscSFCreateRemoteOffsets (sfutils.c:332)
==1783749==    by 0x85124AD: DMPlexPreallocateOperator (plexpreallocate.c:753)
==1783749==    by 0x842B769: DMCreateMatrix_Plex (plex.c:2544)
==1783749==    by 0x82E5D11: DMCreateMatrix (dm.c:1432)
==1783749==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscreti
==1783749==    by 0x4977092: TDyCreateJacobian (tdycore.c:2032)
==1783749==    by 0x497CC0D: TDyDriverInitializeTDy (tdydriver.c:87)
==1783749==    by 0x109B04: main (th_driver.c:33)
==1783749==
iver.c:33)
==1783748==
==1783748== Invalid write of size 8
==1783748==    at 0x4852990: memmove (in /usr/libexec/valgrind/vgpreload_memche
==1783748==    by 0x75A2229: PetscMemcpy (petscsys.h:1634)
==1783748==    by 0x75A4B1C: UnpackAndInsert_PetscInt_1_1 (sfpack.c:374)
==1783748==    by 0x7714DB0: PetscSFLinkUnpackLeafData_Private (sfpack.c:1094)
==1783748==    by 0x7715C3A: PetscSFLinkUnpackLeafData (sfpack.c:1124)
==1783748==    by 0x759E1FC: PetscSFBcastEnd_Basic (sfbasic.c:212)
==1783748==    by 0x756A7E7: PetscSFBcastEnd (sf.c:1472)
==1783748==    by 0x7593811: PetscSFCreateRemoteOffsets (sfutils.c:334)
==1783748==    by 0x85124AD: DMPlexPreallocateOperator (plexpreallocate.c:753)
==1783748==    by 0x842B769: DMCreateMatrix_Plex (plex.c:2544)
==1783748==    by 0x82E5D11: DMCreateMatrix (dm.c:1432)
==1783748==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscreti
==1783748==  Address 0x24a47df0 is 1,648 bytes inside a block of size 1,652 all
==1783748==    at 0x484E120: memalign (in /usr/libexec/valgrind/vgpreload_memch
==1783748==    by 0x744055C: PetscMallocAlign (mal.c:48)
==1783748==    by 0x74446C9: PetscTrMallocDefault (mtr.c:183)
==1783748==    by 0x7442663: PetscMallocA (mal.c:414)
==1783748==    by 0x75936CB: PetscSFCreateRemoteOffsets (sfutils.c:332)
==1783748==    by 0x85124AD: DMPlexPreallocateOperator (plexpreallocate.c:753)
==1783748==    by 0x842B769: DMCreateMatrix_Plex (plex.c:2544)
==1783748==    by 0x82E5D11: DMCreateMatrix (dm.c:1432)
==1783748==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscreti
==1783748==    by 0x4977092: TDyCreateJacobian (tdycore.c:2032)
==1783748==    by 0x497CC0D: TDyDriverInitializeTDy (tdydriver.c:87)
==1783748==    by 0x109B04: main (th_driver.c:33)
==1783748==

And here's the error message I get from running the debuggable executable:

[0]PETSC ERROR: PetscTrFreeDefault() called from DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:755
[0]PETSC ERROR: [1]PETSC ERROR: PetscTrFreeDefault() called from DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:755
[1]PETSC ERROR: Block [id=23900(32)] at address 0x24a53a60 is corrupted (probably write past end of array)
[0]PETSC ERROR: Block allocated in PetscSFCreateRemoteOffsets() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/utils/sfutils.c:332
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Memory corruption: https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: Corrupted memory
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.17.2, unknown 
[0]PETSC ERROR: ./th_driver on a debug named crunchy by jeff Wed Jun 29 11:40:58 2022
Block [id=23425(32)] at address 0x24a47780 is corrupted (probably write past end of array)
[0]PETSC ERROR: Configure options --CFLAGS=-g --CXXFLAGS=-g --FFLAGS="-g -Wno-unused-function" --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-mpi-pkg-config=/usr --with-clanguage=c --with-debugging=1 --with-shared-libraries=1 --with-zlib PETSC_ARCH=debug
[1]PETSC ERROR: Block allocated in PetscSFCreateRemoteOffsets() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/utils/sfutils.c:332
[0]PETSC ERROR: #1 PetscTrFreeDefault() at /home/jeff/projects/pnnl/petsc/src/sys/memory/mtr.c:306
[0]PETSC ERROR: #2 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:755
[0]PETSC ERROR: #3 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2544
[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: #4 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1432
[0]PETSC ERROR: #5 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:166
[0]PETSC ERROR: #6 TDyCreateJacobian() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:2032
[0]PETSC ERROR: #7 TDyDriverInitializeTDy() at /home/jeff/projects/pnnl/TDycore/src/tdydriver.c:87
[1]PETSC ERROR: Memory corruption: https://petsc.org/release/faq/#valgrind
[1]PETSC ERROR: Corrupted memory
[0]PETSC ERROR: [1]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
#8 main() at /home/jeff/projects/pnnl/TDycore/demo/th/th_driver.c:33
[1]PETSC ERROR: Petsc Release Version 3.17.2, unknown 
[0]PETSC ERROR: Reached the main program with an out-of-range error code 1. This should never happen
[1]PETSC ERROR: ./th_driver on a debug named crunchy by jeff Wed Jun 29 11:40:58 2022
[1]PETSC ERROR: Configure options --CFLAGS=-g --CXXFLAGS=-g --FFLAGS="-g -Wno-unused-function" --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-mpi-pkg-config=/usr --with-clanguage=c --with-debugging=1 --with-shared-libraries=1 --with-zlib PETSC_ARCH=debug
[1]PETSC ERROR: #1 PetscTrFreeDefault() at /home/jeff/projects/pnnl/petsc/src/sys/memory/mtr.c:306
[0]PETSC ERROR: PETSc Option Table entries:
[0]PETSC ERROR: -dm_plex_box_faces 2,2,2
[0]PETSC ERROR: -dm_plex_box_lower 0,0,0
[0]PETSC ERROR: -dm_plex_box_upper 1,1,1
[0]PETSC ERROR: -dm_plex_dim 3
[0]PETSC ERROR: -dm_plex_simplex 0
[0]PETSC ERROR: -tdy_dt_growth_factor 1.5
[0]PETSC ERROR: -tdy_dt_max 600.
[0]PETSC ERROR: -tdy_final_time 3.1536e3
[0]PETSC ERROR: -tdy_init_with_random_field
[1]PETSC ERROR: [0]PETSC ERROR: #2 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:755
-tdy_regression_test
[0]PETSC ERROR: -tdy_regression_test_filename th-driver-ts-prob1-np4
[0]PETSC ERROR: -tdy_regression_test_num_cells_per_process 1
[0]PETSC ERROR: -tdy_time_integration_method TS
[0]PETSC ERROR: -tdy_timers
[0]PETSC ERROR: -tdy_water_density exponential
[1]PETSC ERROR: #3 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2544
[1]PETSC ERROR: #4 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1432
[0]PETSC ERROR: ----------------End of Error Message -------send entire error message to petsc-maint@mcs.anl.gov----------
[1]PETSC ERROR: #5 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:166
[1]PETSC ERROR: #6 TDyCreateJacobian() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:2032
[1]PETSC ERROR: #7 TDyDriverInitializeTDy() at /home/jeff/projects/pnnl/TDycore/src/tdydriver.c:87
[1]PETSC ERROR: #8 main() at /home/jeff/projects/pnnl/TDycore/demo/th/th_driver.c:33
[1]PETSC ERROR: Reached the main program with an out-of-range error code 1. This should never happen
[1]PETSC ERROR: PETSc Option Table entries:
[1]PETSC ERROR: -dm_plex_box_faces 2,2,2
[1]PETSC ERROR: -dm_plex_box_lower 0,0,0
[1]PETSC ERROR: -dm_plex_box_upper 1,1,1
[1]PETSC ERROR: -dm_plex_dim 3
[1]PETSC ERROR: -dm_plex_simplex 0
[1]PETSC ERROR: -tdy_dt_growth_factor 1.5
[1]PETSC ERROR: -tdy_dt_max 600.
[1]PETSC ERROR: -tdy_final_time 3.1536e3
Abort(1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0
[1]PETSC ERROR: -tdy_init_with_random_field
[1]PETSC ERROR: -tdy_regression_test
[1]PETSC ERROR: -tdy_regression_test_filename th-driver-ts-prob1-np4
[1]PETSC ERROR: -tdy_regression_test_num_cells_per_process 1
[1]PETSC ERROR: -tdy_time_integration_method TS
[1]PETSC ERROR: -tdy_timers
[1]PETSC ERROR: -tdy_water_density exponential
knepley commented 2 years ago

Thanks for fixing this Jeff. If you need me to push that into PETSc, let me know.

To be clear, I just reordered the parameters in the function call within the demo. I haven't looked at PETSc's generated Fortran code (but wouldn't be opposed to helping validate arguments if there's a mechanism for doing that).

@jeff-cohere This definitely looks like a PETSc bug in the preallocation, but it is incredibly strange that it would happen there. There are explicit bounds checks in the code. Are you running in optimized mode?

I believe so. The PETSc I ran this against was configured thus:

#!/usr/bin/python3
if __name__ == '__main__':
  import sys
  import os
  sys.path.insert(0, os.path.abspath('config'))
  import configure
  configure_options = [
    '--CFLAGS=-g -O0',
    '--CXXFLAGS=-g -O0',
    '--FFLAGS=-g -O0 -Wno-unused-function',
    '--download-exodusii',
    '--download-fblaslapack',
    '--download-hdf5',
    '--download-metis',
    '--download-netcdf',
    '--download-parmetis',
    '--download-pnetcdf',
    '--with-mpich-pkg-config=/usr',
    '--with-clanguage=c',
    '--with-debugging=0',
    '--with-shared-libraries=1',
    '--with-zlib',
    'PETSC_ARCH=opt',
  ]
  configure.petsc_configure(configure_options)

I can also build a debug version and get you a traceback for that.

@jeff-cohere Okay, I think we should run a debugging version. I think these valgrind errors would be caught as logic errors in debugging mode. I do not know exactly what they are, but they look like giving out of bounds point numbers. We can Zoom sometime to track them down. I don't think it would take longer than an hour.

jeff-cohere commented 2 years ago

Thanks, Matt. I added the error message generated by the Valgrind run in the message just above your most recent comment. Looks like PETSc found a bad write with a bounds check, just as you suggested. If this isn't enough to go on, let's try to set up a session.

knepley commented 2 years ago

Thanks, Matt. I added the error message generated by the Valgrind run in the message just above your most recent comment. Looks like PETSc found a bad write with a bounds check, just as you suggested. If this isn't enough to go on, let's try to set up a session.

I can see where it says the memory overwrite occurs, but it should be impossible. Can you run with -dm_view_preallocation so we can get more information about the preallocation step?

jeff-cohere commented 2 years ago

I can see where it says the memory overwrite occurs, but it should be impossible. Can you run with -dm_view_preallocation so we can get more information about the preallocation step?

Sure. But first, let me update to Petsc v3.17.3 to make sure we're not hitting something that's already been fixed (@bishtgautam 's recommendation). Then I'll rerun with the -dm_view_preallocation flag.

jeff-cohere commented 2 years ago

Okay. Here's what we get with PETSc v3.17.3:

jeff@crunchy:~/projects/pnnl/TDycore/demo/th$ mpirun -np 4 valgrind --log-file=poo ./th_driver -dm_plex_simplex 0 -dm_plex_dim 3 -dm_plex_box_faces 2,2,2 -dm_plex_box_lower 0,0,0 -dm_plex_box_upper 1,1,1 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename th-driver-ts-prob1-np4 -tdy_final_time 3.1536e3 -tdy_dt_max 600. -tdy_dt_growth_factor 1.5 -tdy_timers -tdy_init_with_random_field -tdy_time_integration_method TS -dm_view_preallocation
+++++++++++++++++ TDycore +++++++++++++++++
Creating TDycore
Running TH mode.
Beginning TH Driver simulation.
TDycore setup
Using PETSc TS for time integration.
Creating Vectors
Creating Jacobian matrix
Input Section for Preallocation:
PetscSection Object: 4 MPI processes
  type not yet set
2 fields
  field 0 with 1 components
Process 0:
  (   0) dim  1 offset   0
  (   1) dim  1 offset   2
  (   2) dim  1 offset   4
  (   3) dim  1 offset   6
  (   4) dim  1 offset   8
  (   5) dim  1 offset  10
  (   6) dim  1 offset  12
  (   7) dim  1 offset  14
Process 1:
  (   0) dim  1 offset   0
  (   1) dim  1 offset   2
  (   2) dim  1 offset   4
  (   3) dim  1 offset   6
  (   4) dim  1 offset   8
  (   5) dim  1 offset  10
  (   6) dim  1 offset  12
  (   7) dim  1 offset  14
Process 2:
  (   0) dim  1 offset   0
  (   1) dim  1 offset   2
  (   2) dim  1 offset   4
  (   3) dim  1 offset   6
  (   4) dim  1 offset   8
  (   5) dim  1 offset  10
  (   6) dim  1 offset  12
  (   7) dim  1 offset  14
Process 3:
  (   0) dim  1 offset   0
  (   1) dim  1 offset   2
  (   2) dim  1 offset   4
  (   3) dim  1 offset   6
  (   4) dim  1 offset   8
  (   5) dim  1 offset  10
  (   6) dim  1 offset  12
  (   7) dim  1 offset  14
  field 1 with 1 components
Process 0:
  (   0) dim  1 offset   1
  (   1) dim  1 offset   3
  (   2) dim  1 offset   5
  (   3) dim  1 offset   7
  (   4) dim  1 offset   9
  (   5) dim  1 offset  11
  (   6) dim  1 offset  13
  (   7) dim  1 offset  15
Process 1:
  (   0) dim  1 offset   1
  (   1) dim  1 offset   3
  (   2) dim  1 offset   5
  (   3) dim  1 offset   7
  (   4) dim  1 offset   9
  (   5) dim  1 offset  11
  (   6) dim  1 offset  13
  (   7) dim  1 offset  15
Process 2:
  (   0) dim  1 offset   1
  (   1) dim  1 offset   3
  (   2) dim  1 offset   5
  (   3) dim  1 offset   7
  (   4) dim  1 offset   9
  (   5) dim  1 offset  11
  (   6) dim  1 offset  13
  (   7) dim  1 offset  15
Process 3:
  (   0) dim  1 offset   1
  (   1) dim  1 offset   3
  (   2) dim  1 offset   5
  (   3) dim  1 offset   7
  (   4) dim  1 offset   9
  (   5) dim  1 offset  11
  (   6) dim  1 offset  13
  (   7) dim  1 offset  15
Input Global Section for Preallocation:
PetscSection Object: 4 MPI processes
  type not yet set
2 fields
  field 0 with 1 components
Process 0:
  (   0) dim  1 offset   0
  (   1) dim  1 offset   2
  (   2) dim -2 offset  -5
  (   3) dim -2 offset  -7
  (   4) dim -2 offset -13
  (   5) dim -2 offset -15
  (   6) dim -2 offset  -9
  (   7) dim -2 offset -11
Process 1:
  (   0) dim  1 offset   4
  (   1) dim  1 offset   6
  (   2) dim -2 offset  -1
  (   3) dim -2 offset  -3
  (   4) dim -2 offset -13
  (   5) dim -2 offset -15
  (   6) dim -2 offset  -9
  (   7) dim -2 offset -11
Process 2:
  (   0) dim  1 offset   8
  (   1) dim  1 offset  10
  (   2) dim -2 offset  -1
  (   3) dim -2 offset  -3
  (   4) dim -2 offset -13
  (   5) dim -2 offset -15
  (   6) dim -2 offset  -5
  (   7) dim -2 offset  -7
Process 3:
  (   0) dim  1 offset  12
  (   1) dim  1 offset  14
  (   2) dim -2 offset  -1
  (   3) dim -2 offset  -3
  (   4) dim -2 offset  -9
  (   5) dim -2 offset -11
  (   6) dim -2 offset  -5
  (   7) dim -2 offset  -7
  field 1 with 1 components
Process 0:
  (   0) dim  1 offset   1
  (   1) dim  1 offset   3
  (   2) dim -2 offset  -6
  (   3) dim -2 offset  -8
  (   4) dim -2 offset -14
  (   5) dim -2 offset -16
  (   6) dim -2 offset -10
  (   7) dim -2 offset -12
Process 1:
  (   0) dim  1 offset   5
  (   1) dim  1 offset   7
  (   2) dim -2 offset  -2
  (   3) dim -2 offset  -4
  (   4) dim -2 offset -14
  (   5) dim -2 offset -16
  (   6) dim -2 offset -10
  (   7) dim -2 offset -12
Process 2:
  (   0) dim  1 offset   9
  (   1) dim  1 offset  11
  (   2) dim -2 offset  -2
  (   3) dim -2 offset  -4
  (   4) dim -2 offset -14
  (   5) dim -2 offset -16
  (   6) dim -2 offset  -6
  (   7) dim -2 offset  -8
Process 3:
  (   0) dim  1 offset  13
  (   1) dim  1 offset  15
  (   2) dim -2 offset  -2
  (   3) dim -2 offset  -4
  (   4) dim -2 offset -10
  (   5) dim -2 offset -12
  (   6) dim -2 offset  -6
  (   7) dim -2 offset  -8
Input SF for Preallocation:
PetscSF Object: 4 MPI processes
  type: basic
  [0] Number of roots=45, leaves=25, remote ranks=3
  [0] 5 <- (3,10)
  [0] 6 <- (1,2)
  [0] 7 <- (3,12)
  [0] 8 <- (2,6)
  [0] 9 <- (3,2)
  [0] 10 <- (2,9)
  [0] 11 <- (3,8)
  [0] 12 <- (2,4)
  [0] 13 <- (2,5)
  [0] 21 <- (3,20)
  [0] 22 <- (1,14)
  [0] 23 <- (2,22)
  [0] 24 <- (2,18)
  [0] 33 <- (3,38)
  [0] 34 <- (3,40)
  [0] 35 <- (1,27)
  [0] 36 <- (1,29)
  [0] 37 <- (3,43)
  [0] 38 <- (2,33)
  [0] 39 <- (2,35)
  [0] 40 <- (2,37)
  [0] 41 <- (3,33)
  [0] 42 <- (2,30)
  [0] 43 <- (2,31)
  [0] 44 <- (2,32)
  [1] Number of roots=45, leaves=21, remote ranks=2
  [1] 4 <- (3,12)
  [1] 5 <- (3,13)
  [1] 6 <- (3,2)
  [1] 7 <- (3,3)
  [1] 8 <- (3,4)
  [1] 9 <- (3,5)
  [1] 10 <- (2,3)
  [1] 12 <- (2,5)
  [1] 17 <- (3,24)
  [1] 19 <- (3,14)
  [1] 20 <- (2,15)
  [1] 26 <- (3,42)
  [1] 31 <- (3,43)
  [1] 32 <- (3,44)
  [1] 33 <- (3,25)
  [1] 34 <- (3,26)
  [1] 35 <- (3,27)
  [1] 36 <- (3,28)
  [1] 38 <- (2,27)
  [1] 40 <- (2,29)
  [1] 43 <- (2,32)
  [2] Number of roots=45, leaves=9, remote ranks=1
  [2] 7 <- (3,2)
  [2] 11 <- (3,4)
  [2] 12 <- (3,6)
  [2] 13 <- (3,8)
  [2] 24 <- (3,15)
  [2] 41 <- (3,27)
  [2] 42 <- (3,31)
  [2] 43 <- (3,33)
  [2] 44 <- (3,35)
  [3] Number of roots=45, leaves=0, remote ranks=0
  MultiSF sort=rank-order
[0]PETSC ERROR: PetscTrFreeDefault() called from DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:755
[0]PETSC ERROR: [1]PETSC ERROR: PetscTrFreeDefault() called from DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:755
[1]PETSC ERROR: Block [id=24084(32)] at address 0xfcdc410 is corrupted (probably write past end of array)
[0]PETSC ERROR: Block allocated in PetscSFCreateRemoteOffsets() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/utils/sfutils.c:332
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Memory corruption: https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: Corrupted memory
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.17.3, unknown 
Block [id=23584(32)] at address 0x224766e0 is corrupted (probably write past end of array)
[0]PETSC ERROR: [1]PETSC ERROR: ./th_driver on a debug named crunchy by jeff Wed Jun 29 12:28:12 2022
Block allocated in PetscSFCreateRemoteOffsets() at /home/jeff/projects/pnnl/petsc/src/vec/is/sf/utils/sfutils.c:332
[0]PETSC ERROR: Configure options --CFLAGS=-g --CXXFLAGS=-g --FFLAGS="-g -Wno-unused-function" --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-mpi-pkg-config=/usr --with-clanguage=c --with-debugging=1 --with-shared-libraries=1 --with-zlib PETSC_ARCH=debug
[0]PETSC ERROR: #1 PetscTrFreeDefault() at /home/jeff/projects/pnnl/petsc/src/sys/memory/mtr.c:306
[0]PETSC ERROR: #2 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:755
[1]PETSC ERROR: [0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
#3 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2544
[0]PETSC ERROR: #4 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1432
[1]PETSC ERROR: Memory corruption: https://petsc.org/release/faq/#valgrind
[0]PETSC ERROR: [1]PETSC ERROR: Corrupted memory
#5 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:166
[1]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[1]PETSC ERROR: Petsc Release Version 3.17.3, unknown 
[1]PETSC ERROR: ./th_driver on a debug named crunchy by jeff Wed Jun 29 12:28:12 2022
[1]PETSC ERROR: Configure options --CFLAGS=-g --CXXFLAGS=-g --FFLAGS="-g -Wno-unused-function" --download-exodusii --download-fblaslapack --download-hdf5 --download-metis --download-netcdf --download-parmetis --download-pnetcdf --with-mpi-pkg-config=/usr --with-clanguage=c --with-debugging=1 --with-shared-libraries=1 --with-zlib PETSC_ARCH=debug
[0]PETSC ERROR: [1]PETSC ERROR: #6 TDyCreateJacobian() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:2032
#1 PetscTrFreeDefault() at /home/jeff/projects/pnnl/petsc/src/sys/memory/mtr.c:306
[0]PETSC ERROR: #7 TDyDriverInitializeTDy() at /home/jeff/projects/pnnl/TDycore/src/tdydriver.c:87
[1]PETSC ERROR: #2 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:755
[0]PETSC ERROR: #8 main() at /home/jeff/projects/pnnl/TDycore/demo/th/th_driver.c:33
[1]PETSC ERROR: #3 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2544
[0]PETSC ERROR: Reached the main program with an out-of-range error code 1. This should never happen
[1]PETSC ERROR: #4 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1432
[1]PETSC ERROR: [0]PETSC ERROR: #5 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:166
...

And here's the Valgrind log:

...
==1924118== Invalid write of size 4
==1924118==    at 0x76DA7BA: UnpackAndInsert_PetscInt_1_1 (sfpack.c:374)
==1924118==    by 0x784A8F1: PetscSFLinkUnpackLeafData_Private (sfpack.c:1094)
==1924118==    by 0x784B77B: PetscSFLinkUnpackLeafData (sfpack.c:1124)
==1924118==    by 0x76D3D3D: PetscSFBcastEnd_Basic (sfbasic.c:212)
==1924118==    by 0x7690ADA: PetscSFBcastEnd (sf.c:1472)
==1924118==    by 0x76B9B04: PetscSFCreateRemoteOffsets (sfutils.c:334)
==1924118==    by 0x85E2662: DMPlexPreallocateOperator (plexpreallocate.c:753)
==1924118==    by 0x84FB91E: DMCreateMatrix_Plex (plex.c:2544)
==1924118==    by 0x82BB145: DMCreateMatrix (dm.c:1432)
==1924118==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscreti
==1924118==    by 0x4977092: TDyCreateJacobian (tdycore.c:2032)
==1924118==    by 0x497CC0D: TDyDriverInitializeTDy (tdydriver.c:87)
==1924118==  Address 0x2192f5cc is 8 bytes after a block of size 1,652 alloc'd
==1924118==    at 0x484E120: memalign (in /usr/libexec/valgrind/vgpreload_memch
==1924118==    by 0x71417EA: PetscMallocAlign (mal.c:48)
==1924118==    by 0x7145957: PetscTrMallocDefault (mtr.c:183)
==1924118==    by 0x71438F1: PetscMallocA (mal.c:414)
==1924118==    by 0x76B99BE: PetscSFCreateRemoteOffsets (sfutils.c:332)
==1924118==    by 0x85E2662: DMPlexPreallocateOperator (plexpreallocate.c:753)
==1924118==    by 0x84FB91E: DMCreateMatrix_Plex (plex.c:2544)
==1924118==    by 0x82BB145: DMCreateMatrix (dm.c:1432)
==1924118==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscreti
==1924118==    by 0x4977092: TDyCreateJacobian (tdycore.c:2032)
==1924118==    by 0x497CC0D: TDyDriverInitializeTDy (tdydriver.c:87)
==1924118==    by 0x109B04: main (th_driver.c:33)
==1924118==
iver.c:33)
==1924117==
==1924117== Invalid write of size 8
==1924117==    at 0x4852990: memmove (in /usr/libexec/valgrind/vgpreload_memche
==1924117==    by 0x76D7D6A: PetscMemcpy (petscsys.h:1634)
==1924117==    by 0x76DA65D: UnpackAndInsert_PetscInt_1_1 (sfpack.c:374)
==1924117==    by 0x784A8F1: PetscSFLinkUnpackLeafData_Private (sfpack.c:1094)
==1924117==    by 0x784B77B: PetscSFLinkUnpackLeafData (sfpack.c:1124)
==1924117==    by 0x76D3D3D: PetscSFBcastEnd_Basic (sfbasic.c:212)
==1924117==    by 0x7690ADA: PetscSFBcastEnd (sf.c:1472)
==1924117==    by 0x76B9B04: PetscSFCreateRemoteOffsets (sfutils.c:334)
==1924117==    by 0x85E2662: DMPlexPreallocateOperator (plexpreallocate.c:753)
==1924117==    by 0x84FB91E: DMCreateMatrix_Plex (plex.c:2544)
==1924117==    by 0x82BB145: DMCreateMatrix (dm.c:1432)
==1924117==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscreti
==1924117==  Address 0x22476d50 is 1,648 bytes inside a block of size 1,652 all
==1924117==    at 0x484E120: memalign (in /usr/libexec/valgrind/vgpreload_memch
==1924117==    by 0x71417EA: PetscMallocAlign (mal.c:48)
==1924117==    by 0x7145957: PetscTrMallocDefault (mtr.c:183)
==1924117==    by 0x71438F1: PetscMallocA (mal.c:414)
==1924117==    by 0x76B99BE: PetscSFCreateRemoteOffsets (sfutils.c:332)
==1924117==    by 0x85E2662: DMPlexPreallocateOperator (plexpreallocate.c:753)
==1924117==    by 0x84FB91E: DMCreateMatrix_Plex (plex.c:2544)
==1924117==    by 0x82BB145: DMCreateMatrix (dm.c:1432)
==1924117==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscreti
==1924117==    by 0x4977092: TDyCreateJacobian (tdycore.c:2032)
==1924117==    by 0x497CC0D: TDyDriverInitializeTDy (tdydriver.c:87)
==1924117==    by 0x109B04: main (th_driver.c:33)
...
bishtgautam commented 2 years ago

@knepley For the -dm_plex_extrude_normal being used in the TRANSIENT test, I'm getting differences in cell volume between v3.16.2 and v3.17.2. Is the difference in cell volumes between the two versions of PETSc expected?

bishtgautam commented 2 years ago

On my machine for RICHARDS test, I don't get a TDycore internal error but get a PETSc error that is similar to the TH test:

/home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/bin/mpiexec --oversubscribe -n 4 ./richards_driver -malloc 0 -successful_exit_code 0 -dm_plex_simplex 0 -dm_plex_dim 3 -dm_plex_box_faces 2,2,2 -dm_plex_box_lower 0,0,0 -dm_plex_box_upper 1,1,1 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename richards-driver-snes-prob1-np4 -tdy_final_time 3.1536e1 -tdy_dt_max 600. -tdy_dt_growth_factor 1.0 -tdy_init_with_random_field -tdy_time_integration_method SNES -tdy_dt_init 0.1

+++++++++++++++++ TDycore +++++++++++++++++
Creating TDycore
Beginning Richards Driver simulation.
TDycore setup
Using TDycore backward Euler (and PETSc SNES) for time integration.
Creating Vectors
Creating Jacobian matrix
double free or corruption (out)
[aqua:2483333] *** Process received signal ***
[aqua:2483333] Signal: Aborted (6)
[aqua:2483333] Associated errno: Unknown error 32765 (32765)
[aqua:2483333] Signal code: User function (kill, sigsend, abort, etc.) (0)
[aqua:2483333] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x43090)[0x7fa31f848090]
[aqua:2483333] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb)[0x7fa31f84800b]
[aqua:2483333] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b)[0x7fa31f827859]
[aqua:2483333] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x8d26e)[0x7fa31f89226e]
[aqua:2483333] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x952fc)[0x7fa31f89a2fc]
[aqua:2483333] [ 5] /lib/x86_64-linux-gnu/libc.so.6(+0x96fa0)[0x7fa31f89bfa0]
[aqua:2483333] [ 6] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(PetscFreeAlign+0x35)[0x7fa320407ef4]
[aqua:2483333] [ 7] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(+0x7ec8b7)[0x7fa3207c08b7]
[aqua:2483333] [ 8] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(+0x68923e)[0x7fa32065d23e]
[aqua:2483333] [ 9] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(PetscSFReset+0x2c5)[0x7fa320622c9d]
[aqua:2483333] [10] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(PetscSFDestroy+0x78d)[0x7fa320624d5a]
[aqua:2483333] [11] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(PetscSFCreateRemoteOffsets+0xa8d)[0x7fa320617511]
[aqua:2483333] [12] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(DMPlexPreallocateOperator+0x10ea)[0x7fa3217ad657]
[aqua:2483333] [13] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(+0x16ef826)[0x7fa3216c3826]
[aqua:2483333] [14] /home/gbisht/projects/petsc/petsc_v3.17.3/gcc10-openmpi/lib/libpetsc.so.3.17(DMCreateMatrix+0x56b)[0x7fa32136b2c8]
[aqua:2483333] [15] /home/gbisht/projects/tdycore/tdycore/gcc10-openmpi/lib/libtdycore.so(+0x11fbcb)[0x7fa323234bcb]
[aqua:2483333] [16] /home/gbisht/projects/tdycore/tdycore/gcc10-openmpi/lib/libtdycore.so(TDyCreateJacobian+0x24c)[0x7fa3232324d4]
[aqua:2483333] [17] /home/gbisht/projects/tdycore/tdycore/gcc10-openmpi/lib/libtdycore.so(TDyDriverInitializeTDy+0xb95)[0x7fa323237eec]
[aqua:2483333] [18] ./richards_driver(+0x2aa0)[0x55ec50108aa0]
[aqua:2483333] [19] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7fa31f829083]
[aqua:2483333] [20] ./richards_driver(+0x229e)[0x55ec5010829e]
[aqua:2483333] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 2 with PID 0 on node aqua exited on signal 6 (Aborted).
jeff-cohere commented 2 years ago

Here's what I get when I run the Richards test with a debuggable PETSc and Valgrind (I've only included the stack trace from rank 3, since it was the easiest one to pull out). As @bishtgautam says, it does look very much like the TH failure.

Terminal output

$ mpiexec -n 4 valgrind --log-file=poo ./richards_driver -malloc 0 -successful_exit_code 0 -dm_plex_simplex 0 -dm_plex_dim 3 -dm_plex_box_faces 2,2,2 -dm_plex_box_lower 0,0,0 -dm_plex_box_upper 1,1,1 -tdy_water_density exponential -tdy_regression_test -tdy_regression_test_num_cells_per_process 1 -tdy_regression_test_filename richards-driver-snes-prob1-np4 -tdy_final_time 3.1536e1 -tdy_dt_max 600. -tdy_dt_growth_factor 1.0 -tdy_init_with_random_field -tdy_time_integration
_method SNES -tdy_dt_init 0.1
+++++++++++++++++ TDycore +++++++++++++++++
Creating TDycore
Beginning Richards Driver simulation.
TDycore setup
Using TDycore backward Euler (and PETSc SNES) for time integration.
Creating Vectors
Creating Jacobian matrix
[3]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[3]PETSC ERROR: Argument out of range
[3]PETSC ERROR: Section point -1 should be in [6, 8)
[3]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[3]PETSC ERROR: Petsc Release Version 3.17.3, unknown 
[3]PETSC ERROR: ./richards_driver on a debug named crunchy by jeff Thu Jun 30 09:40:42 2022
[3]PETSC ERROR: #1 PetscSectionSetDof() at /home/jeff/projects/pnnl/petsc/src/vec/is/section/interface/section.c:801
[3]PETSC ERROR: #2 DMPlexCreateAdjacencySection_Static() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:472
[3]PETSC ERROR: #3 DMPlexPreallocateOperator() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plexpreallocate.c:777
[3]PETSC ERROR: #4 DMCreateMatrix_Plex() at /home/jeff/projects/pnnl/petsc/src/dm/impls/plex/plex.c:2544
[3]PETSC ERROR: #5 DMCreateMatrix() at /home/jeff/projects/pnnl/petsc/src/dm/interface/dm.c:1432
[3]PETSC ERROR: #6 TDyDiscretizationCreateJacobianMatrix() at /home/jeff/projects/pnnl/TDycore/src/tdydiscretization.c:166
[3]PETSC ERROR: #7 TDyCreateJacobian() at /home/jeff/projects/pnnl/TDycore/src/tdycore.c:2032
[3]PETSC ERROR: #8 TDyDriverInitializeTDy() at /home/jeff/projects/pnnl/TDycore/src/tdydriver.c:87
[3]PETSC ERROR: #9 main() at /home/jeff/projects/pnnl/TDycore/demo/richards/richards_driver.c:33
[3]PETSC ERROR: PETSc Option Table entries:
[3]PETSC ERROR:  -dm_plex_box_faces 2,2,2
[3]PETSC ERROR: -dm_plex_box_lower 0,0,0
[3]PETSC ERROR: -dm_plex_box_upper 1,1,1
[3]PETSC ERROR: -dm_plex_dim 3
[3]PETSC ERROR: -dm_plex_simplex 0
[3]PETSC ERROR:  -malloc 0
[3]PETSC ERROR: -successful_exit_code 0
[3]PETSC ERROR: -tdy_dt_growth_factor 1.0
[3]PETSC ERROR: -tdy_dt_init 0.1
[3]PETSC ERROR: -tdy_dt_max 600.
[3]PETSC ERROR: -tdy_final_time 3.1536e1
[3]PETSC ERROR: -tdy_init_with_random_field
[3]PETSC ERROR: -tdy_regression_test
[3]PETSC ERROR: -tdy_regression_test_filename richards-driver-snes-prob1-np4
[3]PETSC ERROR: -tdy_regression_test_num_cells_per_process 1
[3]PETSC ERROR: -tdy_time_integration_method SNES
[3]PETSC ERROR: -tdy_water_density exponential
[3]PETSC ERROR:  ----------------End of Error Message -------send entire error message to petsc-maint@mcs.anl.gov---------

Valgrind report

...
==2251883== Invalid write of size 2
==2251883==    at 0x48529E3: memmove (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2251883==    by 0x76D7D6A: PetscMemcpy (petscsys.h:1634)
==2251883==    by 0x76DA65D: UnpackAndInsert_PetscInt_1_1 (sfpack.c:374)
==2251883==    by 0x784A8F1: PetscSFLinkUnpackLeafData_Private (sfpack.c:1094)
==2251883==    by 0x784B77B: PetscSFLinkUnpackLeafData (sfpack.c:1124)
==2251883==    by 0x76D3D3D: PetscSFBcastEnd_Basic (sfbasic.c:212)
==2251883==    by 0x7690ADA: PetscSFBcastEnd (sf.c:1472)
==2251883==    by 0x76B9B04: PetscSFCreateRemoteOffsets (sfutils.c:334)
==2251883==    by 0x85E2662: DMPlexPreallocateOperator (plexpreallocate.c:753)
==2251883==    by 0x84FB91E: DMCreateMatrix_Plex (plex.c:2544)
==2251883==    by 0x82BB145: DMCreateMatrix (dm.c:1432)
==2251883==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscretization.c:166)
==2251883==  Address 0x22a10808 is 8 bytes after a block of size 32 alloc'd
==2251883==    at 0x484E120: memalign (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2251883==    by 0x71417EA: PetscMallocAlign (mal.c:48)
==2251883==    by 0x71438F1: PetscMallocA (mal.c:414)
==2251883==    by 0x76B99BE: PetscSFCreateRemoteOffsets (sfutils.c:332)
==2251883==    by 0x85E2662: DMPlexPreallocateOperator (plexpreallocate.c:753)
==2251883==    by 0x84FB91E: DMCreateMatrix_Plex (plex.c:2544)
==2251883==    by 0x82BB145: DMCreateMatrix (dm.c:1432)
==2251883==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscretization.c:166)
==2251883==    by 0x4977092: TDyCreateJacobian (tdycore.c:2032)
==2251883==    by 0x497CC0D: TDyDriverInitializeTDy (tdydriver.c:87)
==2251883==    by 0x109B04: main (richards_driver.c:33)
==2251883== 
==2251883== Invalid write of size 8
==2251883==    at 0x4852990: memmove (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2251883==    by 0x76D7D6A: PetscMemcpy (petscsys.h:1634)
==2251883==    by 0x76DA65D: UnpackAndInsert_PetscInt_1_1 (sfpack.c:374)
==2251883==    by 0x784A8F1: PetscSFLinkUnpackLeafData_Private (sfpack.c:1094)
==2251883==    by 0x784B77B: PetscSFLinkUnpackLeafData (sfpack.c:1124)
==2251883==    by 0x76D3D3D: PetscSFBcastEnd_Basic (sfbasic.c:212)
==2251883==    by 0x7690ADA: PetscSFBcastEnd (sf.c:1472)
==2251883==    by 0x76B9B04: PetscSFCreateRemoteOffsets (sfutils.c:334)
==2251883==    by 0x85E2662: DMPlexPreallocateOperator (plexpreallocate.c:753)
==2251883==    by 0x84FB91E: DMCreateMatrix_Plex (plex.c:2544)
==2251883==    by 0x82BB145: DMCreateMatrix (dm.c:1432)
==2251883==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscretization.c:166)
==2251883==  Address 0x22a10800 is 0 bytes after a block of size 32 alloc'd
==2251883==    at 0x484E120: memalign (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==2251883==    by 0x71417EA: PetscMallocAlign (mal.c:48)
==2251883==    by 0x71438F1: PetscMallocA (mal.c:414)
==2251883==    by 0x76B99BE: PetscSFCreateRemoteOffsets (sfutils.c:332)
==2251883==    by 0x85E2662: DMPlexPreallocateOperator (plexpreallocate.c:753)
==2251883==    by 0x84FB91E: DMCreateMatrix_Plex (plex.c:2544)
==2251883==    by 0x82BB145: DMCreateMatrix (dm.c:1432)
==2251883==    by 0x49798CE: TDyDiscretizationCreateJacobianMatrix (tdydiscretization.c:166)
==2251883==    by 0x4977092: TDyCreateJacobian (tdycore.c:2032)
==2251883==    by 0x497CC0D: TDyDriverInitializeTDy (tdydriver.c:87)
==2251883==    by 0x109B04: main (richards_driver.c:33)
==2251883== 
codecov-commenter commented 2 years ago

Codecov Report

Merging #236 (2efc124) into master (7e85a58) will increase coverage by 5.94%. The diff coverage is 50.00%.

:exclamation: Current head 2efc124 differs from pull request most recent head 5f5416b. Consider uploading reports for the commit 5f5416b to get more accurate results

@@            Coverage Diff             @@
##           master     #236      +/-   ##
==========================================
+ Coverage   51.57%   57.51%   +5.94%     
==========================================
  Files           4        6       +2     
  Lines         764     1125     +361     
==========================================
+ Hits          394      647     +253     
- Misses        370      478     +108     
Impacted Files Coverage Δ
demo/transient/transient_snes_mpfaof90.F90 84.52% <50.00%> (ø)
demo/th/th_driver.c 0.00% <0.00%> (-100.00%) :arrow_down:
demo/richards/richards_driver.c 0.00% <0.00%> (-100.00%) :arrow_down:
demo/transient/transient_mpfaof90.F90 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 7e85a58...5f5416b. Read the comment docs.

jeff-cohere commented 2 years ago

Happy Friday, @knepley . I think @bishtgautam and I are in favor of bookmarking the issues that we've discovered and then merging this PR. How would you like to proceed?

If we suspect that the TH and Richards failures stem from some issue in DMPlex, would you prefer that we log this as an issue in PETSc's repo and/or come up with a reproducer for the PETSc team? Or would you prefer to work on this in the context of TDycore with my help?

knepley commented 2 years ago

Happy Friday, @knepley . I think @bishtgautam and I are in favor of bookmarking the issues that we've discovered and then merging this PR. How would you like to proceed?

If we suspect that the TH and Richards failures stem from some issue in DMPlex, would you prefer that we log this as an issue in PETSc's repo and/or come up with a reproducer for the PETSc team? Or would you prefer to work on this in the context of TDycore with my help?

My preference would also be to merge this. I think working with you on TDycore would probably be easier and more to the point. Whenever you have some time.

jeff-cohere commented 2 years ago

Sounds good to me. My next week is pretty open. Here's my availability, in case you want to schedule something (all Pacific Time):

I'll create an issue and move all the above debris into it.