TEAR-ERC / tandem

A HPC DG method for 2D and 3D SEAS problems
BSD 3-Clause "New" or "Revised" License
18 stars 10 forks source link

F(a) and F(b) must have different sign on first time step of BP5 #74

Closed Thomas-Ulrich closed 1 month ago

Thomas-Ulrich commented 3 months ago

Describe the bug Using tandem p2 and bp5 example for the repository, I get an error at first time step:

di73yeq4@login03:/hppfs/work/pn49ha/di73yeq4/tandem/examples/tandem/3d> head 3451988.tandem.out -n 200
num_nodes: 4 ntasks: 192

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------

Multigrid P-levels: 1 2 
TS ts_checkpoint.storage_type limited
TS ts_checkpoint.save_directory checkpoint
TS ts_checkpoint.freq_step 1000
TS ts_checkpoint.freq_cputime 3.0000e+01
TS ts_checkpoint.freq_physical_time 1.0000e+10
TS ts_checkpoint.storage_limited_size 2
[checkpoint] directory created
DOFs (domain): 1891590
DOFs (fault): 167796
Mesh size: 71.6532
sigma_n = 11.0811
|tau| = 13525.3
psi = -0.220103
L = 0
U = 2924.74
F(L) = 13525.3
sigma_n = 196.612
|tau| = 26418.9
psi = -0.993655
L = 0
U = 5712.89
F(L) = 26418.9
F(U) = 1.61031e-12
sigma_n = 54.621
|tau| = 105097
psi = -6.47109
L = 0
U = 22726.5
F(L) = 105097
F(U) = 5.31919e-12
terminate called after throwing an instance of 'std::logic_error'
sigma_n = 41.6383
|tau| = 13866.2
psi = -0.204948
L = 0
U = 2998.47
F(L) = 13866.2
F(U) = 7.89669e-14
terminate called after throwing an instance of 'std::logic_error'
sigma_n = 19.8669
|tau| = 14586.5
psi = -0.25234
L = 0
U = 3154.22
F(L) = 14586.5
F(U) = 6.96785e-13
  what():  F(a) and F(b) must have different sign.
F(U) = 8.03797e-13
terminate called after throwing an instance of 'std::logic_error'
sigma_n = 58.7748
|tau| = 16364
psi = -0.525257
L = 0
U = 3538.6
F(L) = 16364
F(U) = 7.50726e-13
terminate called after throwing an instance of 'std::logic_error'
sigma_n = 51.8792
|tau| = 15802.3
psi = -0.306186
L = 0
U = 3417.13
F(L) = 15802.3
F(U) = 1.0331e-12
  what():  F(a) and F(b) must have different sign.
terminate called after throwing an instance of 'std::logic_error'
  what():  F(a) and F(b) must have different sign.
terminate called after throwing an instance of 'std::logic_error'
  what():  F(a) and F(b) must have different sign.
terminate called after throwing an instance of 'std::logic_error'
  what():  F(a) and F(b) must have different sign.
  what():  F(a) and F(b) must have different sign.
  what():  F(a) and F(b) must have different sign.
srun: error: i01r01c05s07: task 134: Aborted (core dumped)
srun: launch/slurm: _step_signal: Terminating StepId=3451988.0
slurmstepd: error: *** STEP 3451988.0 ON i01r01c05s05 CANCELLED AT 2024-07-17T11:37:51 ***
[148]PETSC ERROR: ------------------------------------------------------------------------

Expected behavior no error To Reproduce Steps to reproduce the behavior:

I'm running BP5.toml based on this branch https://github.com/TEAR-ERC/tandem/pull/72 (at commit ee87ac9) which is a few commits on top of https://github.com/TEAR-ERC/tandem/pull/59

spack installed on supermuc NG with:

spack install -j 30 tandem@tscp polynomial_degree=2 domain_dimension=3

Here is a list of the dependencies of tandem, and there specs:

di73yeq4@login03:/hppfs/work/pn49ha/di73yeq4/tandem/examples/tandem/3d> spack spec -I  tandem@tscp polynomial_degree=2 domain_dimension=3

Input spec
--------------------------------
 -   tandem@tscp domain_dimension=3 polynomial_degree=2

Concretized
--------------------------------
 -   tandem@tscp%gcc@12.2.0~cuda~ipo~libxsmm~python~rocm build_system=cmake build_type=Release domain_dimension=3 generator=make min_quadrature_order=0 polynomial_degree=2 arch=linux-sles15-skylake_avx512
[^]      ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]          ^ncurses@6.4%gcc@12.2.0~symlinks+termlib abi=none build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^pkgconf@1.8.0%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]          ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]              ^ca-certificates-mozilla@2023-01-10%gcc@12.2.0 build_system=generic arch=linux-sles15-skylake_avx512
[^]              ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]                  ^berkeley-db@18.1.40%gcc@12.2.0+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc arch=linux-sles15-skylake_avx512
[^]      ^eigen@3.4.0%gcc@12.2.0~ipo build_system=cmake build_type=RelWithDebInfo generator=make arch=linux-sles15-skylake_avx512
[^]          ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]              ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]                  ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]                      ^gdbm@1.23%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]                          ^readline@8.2%gcc@12.2.0 build_system=autotools patches=bbf97f1 arch=linux-sles15-skylake_avx512
[^]          ^gmake@4.4.1%gcc@12.2.0~guile build_system=autotools arch=linux-sles15-skylake_avx512
[^]      ^gmake@4.4.1%gcc@12.2.0~guile build_system=autotools arch=linux-sles15-skylake_avx512
[^]      ^intel-oneapi-mpi@2021.9.0%gcc@12.2.0+envmods~external-libfabric~generic-names~ilp64 build_system=generic arch=linux-sles15-skylake_avx512
[^]      ^lua@5.4.4%gcc@12.2.0~pcfile+shared build_system=makefile fetcher=curl arch=linux-sles15-skylake_avx512
[^]          ^curl@8.0.1%gcc@12.2.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2~nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-sles15-skylake_avx512
[^]          ^readline@8.2%gcc@12.2.0 build_system=autotools patches=bbf97f1 arch=linux-sles15-skylake_avx512
[^]          ^unzip@6.0%gcc@12.2.0 build_system=makefile arch=linux-sles15-skylake_avx512
[^]      ^metis@5.1.0%gcc@12.2.0~gdb+int64~ipo~real64+shared build_system=cmake build_type=Release generator=make patches=4991da9,93a7903,b1225da arch=linux-sles15-skylake_avx512
[^]          ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]              ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]                  ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]      ^parmetis@4.0.3%gcc@12.2.0~gdb+int64~ipo+shared build_system=cmake build_type=Release generator=make patches=4f89253,50ed208,704b84f arch=linux-sles15-skylake_avx512
[+]      ^petsc@3.20.1%gcc@12.2.0~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre+int64~jpeg+knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi+mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws+scalapack+shared~strumpack~suite-sparse+superlu-dist~sycl~tetgen~trilinos~valgrind build_system=generic clanguage=C memalign=32 arch=linux-sles15-skylake_avx512
[^]          ^diffutils@3.9%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^libiconv@1.17%gcc@12.2.0 build_system=autotools libs=shared,static arch=linux-sles15-skylake_avx512
[^]          ^hdf5@1.10.9%gcc@12.2.0+cxx+fortran+hl~ipo~java+mpi+shared+szip+threadsafe+tools api=default build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[^]              ^libaec@1.0.6%gcc@12.2.0~ipo+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[+]          ^hypre@develop%gcc@12.2.0~caliper~complex~cuda~debug+fortran~gptune+int64~internal-superlu~magma~mixedint+mpi~openmp~rocm+shared~superlu-dist~sycl~umpire~unified-memory build_system=autotools arch=linux-sles15-skylake_avx512
[^]          ^intel-oneapi-mkl@2023.1.0%gcc@12.2.0+cluster+envmods~ilp64+shared build_system=generic threads=none arch=linux-sles15-skylake_avx512
[^]              ^intel-oneapi-tbb@2021.9.0%gcc@12.2.0+envmods build_system=generic arch=linux-sles15-skylake_avx512
[+]          ^mumps@5.5.1%gcc@12.2.0~blr_mt+complex+double+float~incfort~int64+metis+mpi~openmp+parmetis~ptscotch~scotch+shared build_system=generic patches=373d736 arch=linux-sles15-skylake_avx512
[^]          ^python@3.10.10%gcc@12.2.0+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=0d98e93,7d40923,f2fd060 arch=linux-sles15-skylake_avx512
[^]              ^bzip2@1.0.8%gcc@12.2.0~debug~pic+shared build_system=generic arch=linux-sles15-skylake_avx512
[^]              ^expat@2.5.0%gcc@12.2.0+libbsd build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^libbsd@0.11.7%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]                      ^libmd@1.0.4%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^gdbm@1.23%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^gettext@0.21.1%gcc@12.2.0+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^libxml2@2.10.3%gcc@12.2.0~python build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^tar@1.30%gcc@12.2.0 build_system=autotools zip=pigz arch=linux-sles15-skylake_avx512
[^]                      ^pigz@2.7%gcc@12.2.0 build_system=makefile arch=linux-sles15-skylake_avx512
[^]              ^libffi@3.4.4%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^libxcrypt@4.4.33%gcc@12.2.0~obsolete_api build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^sqlite@3.40.1%gcc@12.2.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^util-linux-uuid@2.38.1%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^xz@5.4.1%gcc@12.2.0~pic build_system=autotools libs=shared,static arch=linux-sles15-skylake_avx512
[+]          ^superlu-dist@develop%gcc@12.2.0~cuda+int64~ipo~openmp+parmetis~rocm+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[^]      ^zlib@1.2.13%gcc@12.2.0+optimize+pic+shared build_system=makefile arch=linux-sles15-skylake_avx512

launched with:

#!/bin/bash
# Job Name and Files (also --job-name)
#SBATCH -J tandem
#Output and error (also --output, --error):
#SBATCH -o ./%j.%x.out
#SBATCH -e ./%j.%x.out

#Initial working directory:
#SBATCH --chdir=./

#Notification and type
#SBATCH --mail-type=END
#SBATCH --mail-user=thomas.ulrich@lmu.de
#SBATCH --no-requeue

#Setup of execution environment
#SBATCH --export=ALL
#SBATCH --account=pn49ha

#SBATCH --ntasks-per-node=48
#SBATCH --cpus-per-task=1
#EAR may impact code performance
#SBATCH --ear=off

##SBATCH --nodes=20 --partition=general --time=00:35:00
#SBATCH --nodes=4 --partition=test --time=00:30:00 
#--exclude="i01r01c[01-02]s[01-12]"

module load slurm_setup

export MP_SINGLE_THREAD=yes
export OMP_NUM_THREADS=1
export MP_TASK_AFFINITY=core:$OMP_NUM_THREADS

echo 'num_nodes:' $SLURM_JOB_NUM_NODES 'ntasks:' $SLURM_NTASKS
ulimit -Ss 2097152

srun tandem bp5.toml  --mg_strategy twolevel --mg_coarse_level 1  --petsc -ksp_max_it 400 -pc_type mg -mg_levels_ksp_max_it 4 -mg_levels_ksp_type cg -mg_levels_pc_type bjacobi -ksp_rtol 1.0e-6 -mg_coarse_pc_type gamg -mg_coarse_ksp_type cg -mg_coarse_ksp_rtol 1.0e-1 -ksp_type gcr -log_view                                            
Thomas-Ulrich commented 2 months ago

I've added some additional error log:

diff --git a/app/localoperator/DieterichRuinaAgeing.h b/app/localoperator/DieterichRuinaAgeing.h
index 5d4b5b6..019edf0 100644
--- a/app/localoperator/DieterichRuinaAgeing.h
+++ b/app/localoperator/DieterichRuinaAgeing.h
@@ -106,7 +106,11 @@ public:
                     V = zeroIn(a, b, fF);
                 } catch (std::exception const&) {
                     std::cout << "sigma_n = " << snAbs << std::endl
+                              << "-sn = " << -sn << std::endl
+                              << "SnPre = " << p_[index].get<SnPre>() << std::endl
                               << "|tau| = " << tauAbs << std::endl
+                              << "|tau_inc| = " << norm(tau) << std::endl
+                              << "|TauPre| = " << norm(p_[index].get<TauPre>()) << std::endl
                               << "psi = " << psi << std::endl
                               << "L = " << a << std::endl
                               << "U = " << b << std::endl

And they show tau_ini is probably correct.

sigma_n = 28.5945
-sn = 3.59447
SnPre = 25
|tau| = 7012.44
|tau_inc| = 6991.29
|TauPre| = 21.1481
psi = -0.790723
L = 0
sigma_n = 80.8889
-sn = 55.8889
SnPre = 25

Also tested v1.0, same issue. (both p1 and p2). Also tested Nico's setup.

Thomas-Ulrich commented 1 month ago

This was because I was not setting the Petsc parameters for the TS file ! Maybe we could catch this missing parameter in the future.