TDycores-Project / TDycore

BSD 2-Clause "Simplified" License
4 stars 0 forks source link

richards_driver crashing for parallel run #93

Closed bishtgautam closed 3 years ago

bishtgautam commented 3 years ago

@jeff-cohere reported the model failure here

$ mpirun -np 2 richards_driver -dim 3 -Nx 100 -Ny 100 -Nz 10 -tdy_timers -final_time 30
No protocol specified
Beginning Richards Driver simulation.
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR:
[0]PETSC ERROR: DetermineCellsAboveAndBelow: No. of cells above (=2) and below (=1) of the vertex_id 56780 are not same. Such a mesh is unsupported.

[0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.12.4-1083-g1a6d72e33c  GIT Date: 2020-03-26 13:14:23 -0500
[0]PETSC ERROR: richards_driver on a debug named crunchy by jeff Wed Oct  7 09:16:40 2020
[0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --CFLAGS="-g -O0" --CXXFLAGS="-g -O0" --FFLAGS="-g -O0 -Wno-unused-function" --with-clanguage=c --with-debugging=1 --with-shared-libraries=0 --download-hdf5 --download-metis --download-parmetis --download-exodusii --download-netcdf --download-pnetcdf --download-zlib --download-fblaslapack
[0]PETSC ERROR: #1 DetermineCellsAboveAndBelow() line 1783 in /home/jeff/projects/pnnl/TDycore/src/mesh/tdycoremesh.c
[0]PETSC ERROR: #2 FindCellsAboveAndBelowAVertex() line 2523 in /home/jeff/projects/pnnl/TDycore/src/mesh/tdycoremesh.c
[0]PETSC ERROR: #3 FindCellsAboveAndBelowVertices() line 2615 in /home/jeff/projects/pnnl/TDycore/src/mesh/tdycoremesh.c
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, probably memory access out of range
[0]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[0]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[0]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[0]PETSC ERROR: likely location of problem given in stack below
[0]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[0]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[0]PETSC ERROR:       INSTEAD the line number of the start of the function
[0]PETSC ERROR:       is given.
[0]PETSC ERROR: [0] ComputeTransmissibilityMatrix_ForNonCornerVertex line 425 /home/jeff/projects/pnnl/TDycore/src/mpfao/3D/tdympfao3D_core.c
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Signal received
[0]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[0]PETSC ERROR: Petsc Development GIT revision: v3.12.4-1083-g1a6d72e33c  GIT Date: 2020-03-26 13:14:23 -0500
[0]PETSC ERROR: richards_driver on a debug named crunchy by jeff Wed Oct  7 09:16:40 2020
[0]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --CFLAGS="-g -O0" --CXXFLAGS="-g -O0" --FFLAGS="-g -O0 -Wno-unused-function" --with-clanguage=c --with-debugging=1 --with-shared-libraries=0 --download-hdf5 --download-metis --download-parmetis --download-exodusii --download-netcdf --download-pnetcdf --download-zlib --download-fblaslapack
[0]PETSC ERROR: #4 User provided function() line 0 in  unknown file
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 50152059.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[1]PETSC ERROR: ------------------------------------------------------------------------
[1]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end
[1]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
[1]PETSC ERROR: or see https://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
[1]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to find memory corruption errors
[1]PETSC ERROR: likely location of problem given in stack below
[1]PETSC ERROR: ---------------------  Stack Frames ------------------------------------
[1]PETSC ERROR: Note: The EXACT line numbers in the stack are not available,
[1]PETSC ERROR:       INSTEAD the line number of the start of the function
[1]PETSC ERROR:       is given.
[1]PETSC ERROR: [1] MatAssemblyEnd_SeqAIJ line 1049 /home/jeff/projects/pnnl/petsc/src/mat/impls/aij/seq/aij.c
[1]PETSC ERROR: [1] MatAssemblyEnd line 5335 /home/jeff/projects/pnnl/petsc/src/mat/interface/matrix.c
[1]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[1]PETSC ERROR: Signal received
[1]PETSC ERROR: See https://www.mcs.anl.gov/petsc/documentation/faq.html for trouble shooting.
[1]PETSC ERROR: Petsc Development GIT revision: v3.12.4-1083-g1a6d72e33c  GIT Date: 2020-03-26 13:14:23 -0500
[1]PETSC ERROR: richards_driver on a debug named crunchy by jeff Wed Oct  7 09:16:40 2020
[1]PETSC ERROR: Configure options --with-cc=mpicc --with-cxx=mpicxx --with-fc=mpif90 --CFLAGS="-g -O0" --CXXFLAGS="-g -O0" --FFLAGS="-g -O0 -Wno-unused-function" --with-clanguage=c --with-debugging=1 --with-shared-libraries=0 --download-hdf5 --download-metis --download-parmetis --download-exodusii --download-netcdf --download-pnetcdf --download-zlib --download-fblaslapack
[1]PETSC ERROR: #1 User provided function() line 0 in  unknown file
[crunchy:30100] 1 more process has sent help message help-mpi-api.txt / mpi-abort
[crunchy:30100] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
bishtgautam commented 3 years ago
bishtgautam commented 3 years ago

The error can be reproduced on a smaller domain via:

nx=2;ny=5;nz=10; mpiexec -n 2 \
./richards_driver -dim 3 -Nx $nx -Ny $ny -Nz $nz -tdy_water_density exponential -final_time 1.e0
bishtgautam commented 3 years ago

The error is because DMPlex is using a star stencil, but we need to use a box stencil.

@knepley How can I tell DMPlex to use box stencil instead of star stencil? Here is the code that I'm using to set 1 DOF at cell centers:

  ierr = PetscSectionCreate(comm, &sec); CHKERRQ(ierr);
  ierr = PetscSectionSetNumFields(sec, 1); CHKERRQ(ierr);
  ierr = PetscSectionSetFieldName(sec, 0, "LiquidPressure"); CHKERRQ(ierr);
  ierr = PetscSectionSetFieldComponents(sec, 0, 1); CHKERRQ(ierr);

  ierr = DMPlexGetHeightStratum(dm,0,&pStart,&pEnd); CHKERRQ(ierr);
  ierr = PetscSectionSetChart(sec,pStart,pEnd); CHKERRQ(ierr);
  for(p=pStart; p<pEnd; p++) {
    ierr = PetscSectionSetFieldDof(sec,p,0,1); CHKERRQ(ierr);
    ierr = PetscSectionSetDof(sec,p,1); CHKERRQ(ierr);
  }
  ierr = PetscSectionSetUp(sec); CHKERRQ(ierr);
  ierr = DMSetSection(dm,sec); CHKERRQ(ierr);
  ierr = PetscSectionViewFromOptions(sec, NULL, "-layout_view"); CHKERRQ(ierr);
  ierr = PetscSectionDestroy(&sec); CHKERRQ(ierr);
  ierr = DMSetBasicAdjacency(dm,PETSC_TRUE,PETSC_TRUE); CHKERRQ(ierr);
knepley commented 3 years ago

On Thu, Oct 8, 2020 at 12:46 PM Gautam Bisht notifications@github.com wrote:

The error is because DMPlex is using a star stencil, but we need to use a box stencil.

@knepley https://github.com/knepley How can I tell DMPlex to use box stencil instead of star stencil? Here is the code that I'm using to set 1 DOF at cell centers:

ierr = PetscSectionCreate(comm, &sec); CHKERRQ(ierr); ierr = PetscSectionSetNumFields(sec, 1); CHKERRQ(ierr); ierr = PetscSectionSetFieldName(sec, 0, "LiquidPressure"); CHKERRQ(ierr); ierr = PetscSectionSetFieldComponents(sec, 0, 1); CHKERRQ(ierr);

ierr = DMPlexGetHeightStratum(dm,0,&pStart,&pEnd); CHKERRQ(ierr); ierr = PetscSectionSetChart(sec,pStart,pEnd); CHKERRQ(ierr); for(p=pStart; p<pEnd; p++) { ierr = PetscSectionSetFieldDof(sec,p,0,1); CHKERRQ(ierr); ierr = PetscSectionSetDof(sec,p,1); CHKERRQ(ierr); } ierr = PetscSectionSetUp(sec); CHKERRQ(ierr); ierr = DMSetSection(dm,sec); CHKERRQ(ierr); ierr = PetscSectionViewFromOptions(sec, NULL, "-layout_view"); CHKERRQ(ierr); ierr = PetscSectionDestroy(&sec); CHKERRQ(ierr); ierr = DMSetBasicAdjacency(dm,PETSC_TRUE,PETSC_TRUE); CHKERRQ(ierr);

This is a box stencil. A star would be TRUE: FALSE.

https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/DM/DMSetBasicAdjacency.html

You can see the sparsity pattern using

-mat_view draw -draw_pause -1

Thanks,

 Matt

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/TDycores-Project/TDycore/issues/93#issuecomment-705693062, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAEORCMX222QOG7KBEX4COLSJXUH3ANCNFSM4SHZH7QQ .

-- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener

https://www.cse.buffalo.edu/~knepley/ http://www.cse.buffalo.edu/~knepley/

bishtgautam commented 3 years ago

Thanks @knepley. You were correct that the existing stencil was a box stencil. The error was because the code wasn't skipping non-local vertices.