TEAR-ERC / tandem

A HPC DG method for 2D and 3D SEAS problems
BSD 3-Clause "New" or "Revised" License
18 stars 10 forks source link

Slow Green function checkpointing on large setups risks unusable gf file #73

Open Thomas-Ulrich opened 4 months ago

Thomas-Ulrich commented 4 months ago

Describe the bug I'm running BP5.toml based on this branch https://github.com/TEAR-ERC/tandem/pull/72 (at commit ee87ac9) which is a few commits on top of https://github.com/TEAR-ERC/tandem/pull/59

I changed res_f to 5 to have a very small mesh to test. Im BP5.toml, I add:

[gf_checkpoint]
prefix = "GreensFunctions/bp6_hf250"
freq_cputime = 0.01

So that green functions are checkpointed every new green function. Generally it works. But it also happened several times that it was not able to restart. E.g. job killed during generation of GF:

num_nodes: 1 ntasks: 48

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------

Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 8856 x 5904
GF loaded was created with commsize matching current (48).
load_discrete_greens_operator() 1.95e+00 (sec)
  status: loaded 7 / pending 5897
partial_assemble_discrete_greens_function() [7 , 5904)
Computing Green's function 7 / 5904
write_discrete_greens_operator():matrix 3.47e+00 (sec)
  status: computed 8 / pending 5896
write_discrete_greens_operator():facets 8.07e-03 (sec)
Computing Green's function 8 / 5904
write_discrete_greens_operator():matrix 3.39e+00 (sec)
  status: computed 9 / pending 5895
write_discrete_greens_operator():facets 6.91e-03 (sec)
Computing Green's function 9 / 5904
slurmstepd: error: *** STEP 3451849.0 ON i01r01c04s04 CANCELLED AT 2024-07-17T10:05:43 ***
[0]PETSC ERROR: ------------------------------------------------------------------------
[0]PETSC ERROR: Caught signal number 15 Terminate: Some process (or the batch system) has told this process to end

Next job failing:

num_nodes: 1 ntasks: 48

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------

Multigrid P-levels: 1 2 
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 8856 x 5904
GF loaded was created with commsize matching current (48).
[0]PETSC ERROR: --------------------- Error Message --------------------------------------------------------------
[0]PETSC ERROR: Read from file failed
[0]PETSC ERROR: Read past end of file
[0]PETSC ERROR: WARNING! There are unused option(s) set! Could be the program crashed before usage or a spelling mistake, etc!
[0]PETSC ERROR:   Option left: name:-mg_coarse_ksp_rtol value: 1.0e-1 source: command line
[0]PETSC ERROR:   Option left: name:-mg_coarse_ksp_type value: cg source: command line
[0]PETSC ERROR:   Option left: name:-mg_coarse_pc_type value: gamg source: command line
[0]PETSC ERROR:   Option left: name:-mg_levels_ksp_max_it value: 4 source: command line
[0]PETSC ERROR:   Option left: name:-mg_levels_ksp_type value: cg source: command line
[0]PETSC ERROR:   Option left: name:-mg_levels_pc_type value: bjacobi source: command line
[0]PETSC ERROR: See https://petsc.org/release/faq/ for trouble shooting.
[0]PETSC ERROR: Petsc Release Version 3.20.1, Oct 31, 2023 
[0]PETSC ERROR: --petsc on a  named i01r01c04s04 by di73yeq4 Wed Jul 17 10:07:21 2024
[0]PETSC ERROR: Configure options --prefix=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/petsc/3.20.1-gcc-12.2.0-vlbrevt --with-ssl=0 --download-c2html=0 --download-sowing=0 --download-hwloc=0 --with-make-exec=make --with-cc=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mpi/2021.9.0-gcc-xizuusf/mpi/2021.9.0/bin/mpiicc --with-cxx=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mpi/2021.9.0-gcc-xizuusf/mpi/2021.9.0/bin/mpiicpc --with-fc=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mpi/2021.9.0-gcc-xizuusf/mpi/2021.9.0/bin/mpiifort --with-precision=double --with-scalar-type=real --with-shared-libraries=1 --with-debugging=0 --with-openmp=0 --with-64-bit-indices=1 --with-blaslapack-lib="/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_scalapack_lp64.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_cdft_core.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_intel_lp64.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_sequential.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_core.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so /usr/lib64/libpthread.so /usr/lib64/libm.so /usr/lib64/libdl.so" --with-avx-512-kernels --with-memalign=64 --with-x=0 --with-sycl=0 --with-clanguage=C --with-cuda=0 --with-hip=0 --with-metis=1 --with-metis-include=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/metis/5.1.0-gcc-kougmmh/include --with-metis-lib=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/metis/5.1.0-gcc-kougmmh/lib/libmetis.so --with-hypre=1 --with-hypre-include=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/hypre/develop-gcc-12.2.0-ngxzdup/include --with-hypre-lib=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/hypre/develop-gcc-12.2.0-ngxzdup/lib/libHYPRE.so --with-parmetis=1 --with-parmetis-include=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/parmetis/4.0.3-gcc-nypuwzn/include --with-parmetis-lib=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/parmetis/4.0.3-gcc-nypuwzn/lib/libparmetis.so --with-kokkos=0 --with-kokkos-kernels=0 --with-superlu_dist=1 --with-superlu_dist-include=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/superlu-dist/develop-gcc-12.2.0-z2v2xhr/include --with-superlu_dist-lib=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/superlu-dist/develop-gcc-12.2.0-z2v2xhr/lib/libsuperlu_dist.so --with-ptscotch=0 --with-suitesparse=0 --with-hdf5=1 --with-hdf5-include=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/include --with-hdf5-lib="/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_hl_fortran.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_hl_f90cstub.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_hl.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_fortran.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5_f90cstub.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/hdf5/1.10.9-gcc-hbsptk3/lib/libhdf5.so" --with-zlib=1 --with-zlib-include=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/zlib/1.2.13-gcc-p5ywc53/include --with-zlib-lib=/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/zlib/1.2.13-gcc-p5ywc53/lib/libz.so --with-mumps=1 --with-mumps-include=/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/include --with-mumps-lib="/hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libdmumps.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libzmumps.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libsmumps.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libcmumps.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libmumps_common.so /hppfs/work/pn49ha/di73yeq4/user_spack23.1/linux-sles15-skylake_avx512/mumps/5.5.1-gcc-12.2.0-g6h6l34/lib/libpord.so" --with-trilinos=0 --with-fftw=0 --with-valgrind=0 --with-gmp=0 --with-libpng=0 --with-giflib=0 --with-mpfr=0 --with-netcdf=0 --with-pnetcdf=0 --with-moab=0 --with-random123=0 --with-exodusii=0 --with-cgns=0 --with-memkind=0 --with-p4est=0 --with-saws=0 --with-yaml=0 --with-hwloc=0 --with-libjpeg=0 --with-scalapack=1 --with-scalapack-lib="/dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_scalapack_lp64.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_cdft_core.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_intel_lp64.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_sequential.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_core.so /dss/lrzsys/sys/spack/release/23.1.0/opt/skylake_avx512/intel-oneapi-mkl/2023.1.0-gcc-3x7vfpz/mkl/2023.1.0/lib/intel64/libmkl_blacs_intelmpi_lp64.so /usr/lib64/libpthread.so /usr/lib64/libm.so /usr/lib64/libdl.so" --with-strumpack=0 --with-mmg=0 --with-parmmg=0 --with-tetgen=0
[0]PETSC ERROR: #1 PetscBinaryRead() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/sys/fileio/sysio.c:327
[0]PETSC ERROR: #2 PetscViewerBinaryWriteReadAll() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/sys/classes/viewer/impls/binary/binv.c:1076
[0]PETSC ERROR: #3 PetscViewerBinaryReadAll() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/sys/classes/viewer/impls/binary/binv.c:1118
[0]PETSC ERROR: #4 MatLoad_Dense_Binary() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/mat/impls/dense/seq/dense.c:1408
[0]PETSC ERROR: #5 MatLoad_MPIDense() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/mat/impls/dense/mpi/mpidense.c:1900
[0]PETSC ERROR: #6 MatLoad() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-petsc-3.20.1-vlbrevtepleszjprszhrmwuv5l6azakr/spack-src/src/mat/interface/matrix.c:1339
[0]PETSC ERROR: #7 load_discrete_greens_operator() at /hppfs/scratch/0A/di73yeq4/tmp/build_stage/spack-stage-tandem-tscp-omrqpkb5k5ca6s67eap67wcvpa5xijea/spack-src/app/form/SeasQDDiscreteGreenOperator.cpp:512
terminate called after throwing an instance of 'tndm::petsc_error'

I noticed similar issues on kernelpanic.

Expected behavior the green function generation should have started again.

To Reproduce Steps to reproduce the behavior: spack intstalled on supermuc NG with: spack install -j 30 tandem@tscp polynomial_degree=2 domain_dimension=3

Here is a list of the dependencies of tandem, and there specs:

di73yeq4@login03:/hppfs/work/pn49ha/di73yeq4/tandem/examples/tandem/3d> spack spec -I  tandem@tscp polynomial_degree=2 domain_dimension=3

Input spec
--------------------------------
 -   tandem@tscp domain_dimension=3 polynomial_degree=2

Concretized
--------------------------------
 -   tandem@tscp%gcc@12.2.0~cuda~ipo~libxsmm~python~rocm build_system=cmake build_type=Release domain_dimension=3 generator=make min_quadrature_order=0 polynomial_degree=2 arch=linux-sles15-skylake_avx512
[^]      ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]          ^ncurses@6.4%gcc@12.2.0~symlinks+termlib abi=none build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^pkgconf@1.8.0%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]          ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]              ^ca-certificates-mozilla@2023-01-10%gcc@12.2.0 build_system=generic arch=linux-sles15-skylake_avx512
[^]              ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]                  ^berkeley-db@18.1.40%gcc@12.2.0+cxx~docs+stl build_system=autotools patches=26090f4,b231fcc arch=linux-sles15-skylake_avx512
[^]      ^eigen@3.4.0%gcc@12.2.0~ipo build_system=cmake build_type=RelWithDebInfo generator=make arch=linux-sles15-skylake_avx512
[^]          ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]              ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]                  ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]                      ^gdbm@1.23%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]                          ^readline@8.2%gcc@12.2.0 build_system=autotools patches=bbf97f1 arch=linux-sles15-skylake_avx512
[^]          ^gmake@4.4.1%gcc@12.2.0~guile build_system=autotools arch=linux-sles15-skylake_avx512
[^]      ^gmake@4.4.1%gcc@12.2.0~guile build_system=autotools arch=linux-sles15-skylake_avx512
[^]      ^intel-oneapi-mpi@2021.9.0%gcc@12.2.0+envmods~external-libfabric~generic-names~ilp64 build_system=generic arch=linux-sles15-skylake_avx512
[^]      ^lua@5.4.4%gcc@12.2.0~pcfile+shared build_system=makefile fetcher=curl arch=linux-sles15-skylake_avx512
[^]          ^curl@8.0.1%gcc@12.2.0~gssapi~ldap~libidn2~librtmp~libssh~libssh2~nghttp2 build_system=autotools libs=shared,static tls=openssl arch=linux-sles15-skylake_avx512
[^]          ^readline@8.2%gcc@12.2.0 build_system=autotools patches=bbf97f1 arch=linux-sles15-skylake_avx512
[^]          ^unzip@6.0%gcc@12.2.0 build_system=makefile arch=linux-sles15-skylake_avx512
[^]      ^metis@5.1.0%gcc@12.2.0~gdb+int64~ipo~real64+shared build_system=cmake build_type=Release generator=make patches=4991da9,93a7903,b1225da arch=linux-sles15-skylake_avx512
[^]          ^cmake@3.26.3%gcc@12.2.0~doc+ncurses+ownlibs~qt build_system=generic build_type=Release arch=linux-sles15-skylake_avx512
[^]              ^openssl@1.1.1t%gcc@12.2.0~docs~shared build_system=generic certs=mozilla arch=linux-sles15-skylake_avx512
[^]                  ^perl@5.36.0%gcc@12.2.0+cpanm+open+shared+threads build_system=generic arch=linux-sles15-skylake_avx512
[^]      ^parmetis@4.0.3%gcc@12.2.0~gdb+int64~ipo+shared build_system=cmake build_type=Release generator=make patches=4f89253,50ed208,704b84f arch=linux-sles15-skylake_avx512
[+]      ^petsc@3.20.1%gcc@12.2.0~X~batch~cgns~complex~cuda~debug+double~exodusii~fftw+fortran~giflib+hdf5~hpddm~hwloc+hypre+int64~jpeg+knl~kokkos~libpng~libyaml~memkind+metis~mkl-pardiso~mmg~moab~mpfr+mpi+mumps~openmp~p4est~parmmg~ptscotch~random123~rocm~saws+scalapack+shared~strumpack~suite-sparse+superlu-dist~sycl~tetgen~trilinos~valgrind build_system=generic clanguage=C memalign=32 arch=linux-sles15-skylake_avx512
[^]          ^diffutils@3.9%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^libiconv@1.17%gcc@12.2.0 build_system=autotools libs=shared,static arch=linux-sles15-skylake_avx512
[^]          ^hdf5@1.10.9%gcc@12.2.0+cxx+fortran+hl~ipo~java+mpi+shared+szip+threadsafe+tools api=default build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[^]              ^libaec@1.0.6%gcc@12.2.0~ipo+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[+]          ^hypre@develop%gcc@12.2.0~caliper~complex~cuda~debug+fortran~gptune+int64~internal-superlu~magma~mixedint+mpi~openmp~rocm+shared~superlu-dist~sycl~umpire~unified-memory build_system=autotools arch=linux-sles15-skylake_avx512
[^]          ^intel-oneapi-mkl@2023.1.0%gcc@12.2.0+cluster+envmods~ilp64+shared build_system=generic threads=none arch=linux-sles15-skylake_avx512
[^]              ^intel-oneapi-tbb@2021.9.0%gcc@12.2.0+envmods build_system=generic arch=linux-sles15-skylake_avx512
[+]          ^mumps@5.5.1%gcc@12.2.0~blr_mt+complex+double+float~incfort~int64+metis+mpi~openmp+parmetis~ptscotch~scotch+shared build_system=generic patches=373d736 arch=linux-sles15-skylake_avx512
[^]          ^python@3.10.10%gcc@12.2.0+bz2+crypt+ctypes+dbm~debug+libxml2+lzma~nis~optimizations+pic+pyexpat+pythoncmd+readline+shared+sqlite3+ssl~tkinter+uuid+zlib build_system=generic patches=0d98e93,7d40923,f2fd060 arch=linux-sles15-skylake_avx512
[^]              ^bzip2@1.0.8%gcc@12.2.0~debug~pic+shared build_system=generic arch=linux-sles15-skylake_avx512
[^]              ^expat@2.5.0%gcc@12.2.0+libbsd build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^libbsd@0.11.7%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]                      ^libmd@1.0.4%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^gdbm@1.23%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^gettext@0.21.1%gcc@12.2.0+bzip2+curses+git~libunistring+libxml2+tar+xz build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^libxml2@2.10.3%gcc@12.2.0~python build_system=autotools arch=linux-sles15-skylake_avx512
[^]                  ^tar@1.30%gcc@12.2.0 build_system=autotools zip=pigz arch=linux-sles15-skylake_avx512
[^]                      ^pigz@2.7%gcc@12.2.0 build_system=makefile arch=linux-sles15-skylake_avx512
[^]              ^libffi@3.4.4%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^libxcrypt@4.4.33%gcc@12.2.0~obsolete_api build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^sqlite@3.40.1%gcc@12.2.0+column_metadata+dynamic_extensions+fts~functions+rtree build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^util-linux-uuid@2.38.1%gcc@12.2.0 build_system=autotools arch=linux-sles15-skylake_avx512
[^]              ^xz@5.4.1%gcc@12.2.0~pic build_system=autotools libs=shared,static arch=linux-sles15-skylake_avx512
[+]          ^superlu-dist@develop%gcc@12.2.0~cuda+int64~ipo~openmp+parmetis~rocm+shared build_system=cmake build_type=Release generator=make arch=linux-sles15-skylake_avx512
[^]      ^zlib@1.2.13%gcc@12.2.0+optimize+pic+shared build_system=makefile arch=linux-sles15-skylake_avx512
Thomas-Ulrich commented 4 months ago

I think the problem is when the job crashed while writing the green functions. (it is probably overwriting the old file). Note that the mesh is tiny, but write_discrete_greens_operator takes ages (several seconds)

num_nodes: 1 ntasks: 48

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------

Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 8856 x 5904
partial_assemble_discrete_greens_function() [0 , 5904)
Computing Green's function 0 / 5904
write_discrete_greens_operator():matrix 3.54e+00 (sec)
  status: computed 1 / pending 5903
write_discrete_greens_operator():facets 1.59e-02 (sec)
Computing Green's function 1 / 5904
write_discrete_greens_operator():matrix 3.36e+00 (sec)
  status: computed 2 / pending 5902
write_discrete_greens_operator():facets 6.75e-03 (sec)
Thomas-Ulrich commented 4 months ago

E.g. of timing:

  Total time:      4.29e+00 sec
  Open file:       4.90e-05 sec
  Write commsize:  2.88e-01 sec
  Write current_gf:2.17e-06 sec
  MatView:         3.69e+00 sec
  Close file:      3.16e-01 sec
  Print status:    5.60e-05 sec
  Write facet:     1.42e-03 sec
Thomas-Ulrich commented 4 months ago

ok, I guess the problem is that the full green function (including the zeros) needs to be written at each call.

Thomas-Ulrich commented 4 months ago

Here is an example of BP5 with the default mesh. Checkpointing 152Gb in 19min !!!

num_nodes: 6 ntasks: 288

               ___          ___         _____         ___          ___
      ___     /  /\        /__/\       /  /::\       /  /\        /__/\
     /  /\   /  /::\       \  \:\     /  /:/\:\     /  /:/_      |  |::\
    /  /:/  /  /:/\:\       \  \:\   /  /:/  \:\   /  /:/ /\     |  |:|:\
   /  /:/  /  /:/~/::\  _____\__\:\ /__/:/ \__\:| /  /:/ /:/_  __|__|:|\:\
  /  /::\ /__/:/ /:/\:\/__/::::::::\\  \:\ /  /://__/:/ /:/ /\/__/::::| \:\
 /__/:/\:\\  \:\/:/__\/\  \:\~~\~~\/ \  \:\  /:/ \  \:\/:/ /:/\  \:\~~\__\/
 \__\/  \:\\  \::/      \  \:\  ~~~   \  \:\/:/   \  \::/ /:/  \  \:\
      \  \:\\  \:\       \  \:\        \  \::/     \  \:\/:/    \  \:\
       \__\/ \  \:\       \  \:\        \__\/       \  \::/      \  \:\
              \__\/        \__\/                     \__\/        \__\/

                          tandem version ee87ac9

                        stack size limit = 2048 MiB

                              Worker affinity
    0---------|----------|----------|----------|--------8-|----------|
    ----------|----------|----------|------

Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 167796 x 111864
partial_assemble_discrete_greens_function() [0 , 111864)
Computing Green's function 0 / 111864
write_discrete_greens_operator():matrix 1.14e+03 (sec)
  status: computed 1 / pending 111863
write_discrete_greens_operator():facets 1.18e-02 (sec)
Computing Green's function 1 / 111864
Thomas-Ulrich commented 4 months ago

Ok, it seems I fixed one of the problem with this simple commit:

https://github.com/TEAR-ERC/tandem/pull/72/commits/739b36d465dd74a032f582112b151cb80cd2a59c

Now checkpointing is much faster!

Multigrid P-levels: 1 2 
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 167796 x 111864
partial_assemble_discrete_greens_function() [0 , 111864)
Computing Green's function 0 / 111864
write_discrete_greens_operator():matrix 1.55e+01 (sec)
  status: computed 1 / pending 111863
write_discrete_greens_operator():facets 8.62e-03 (sec)
Computing Green's function 1 / 111864
write_discrete_greens_operator():matrix 1.65e+01 (sec)
  status: computed 2 / pending 111862
write_discrete_greens_operator():facets 8.93e-03 (sec)
Computing Green's function 2 / 111864
write_discrete_greens_operator():matrix 1.56e+01 (sec)
  status: computed 3 / pending 111861
write_discrete_greens_operator():facets 1.10e-02 (sec)
Computing Green's function 3 / 111864
write_discrete_greens_operator():matrix 1.62e+01 (sec)
Thomas-Ulrich commented 4 months ago

and with https://github.com/TEAR-ERC/tandem/pull/72/commits/cfd7a258b9adb64e2ff8e47f8bad37b65732406a I fixed the rest of the issue.