Open Thomas-Ulrich opened 4 months ago
I think the problem is when the job crashed while writing the green functions. (it is probably overwriting the old file). Note that the mesh is tiny, but write_discrete_greens_operator takes ages (several seconds)
num_nodes: 1 ntasks: 48
___ ___ _____ ___ ___
___ / /\ /__/\ / /::\ / /\ /__/\
/ /\ / /::\ \ \:\ / /:/\:\ / /:/_ | |::\
/ /:/ / /:/\:\ \ \:\ / /:/ \:\ / /:/ /\ | |:|:\
/ /:/ / /:/~/::\ _____\__\:\ /__/:/ \__\:| / /:/ /:/_ __|__|:|\:\
/ /::\ /__/:/ /:/\:\/__/::::::::\\ \:\ / /://__/:/ /:/ /\/__/::::| \:\
/__/:/\:\\ \:\/:/__\/\ \:\~~\~~\/ \ \:\ /:/ \ \:\/:/ /:/\ \:\~~\__\/
\__\/ \:\\ \::/ \ \:\ ~~~ \ \:\/:/ \ \::/ /:/ \ \:\
\ \:\\ \:\ \ \:\ \ \::/ \ \:\/:/ \ \:\
\__\/ \ \:\ \ \:\ \__\/ \ \::/ \ \:\
\__\/ \__\/ \__\/ \__\/
tandem version ee87ac9
stack size limit = 2048 MiB
Worker affinity
0---------|----------|----------|----------|--------8-|----------|
----------|----------|----------|------
Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 8856 x 5904
partial_assemble_discrete_greens_function() [0 , 5904)
Computing Green's function 0 / 5904
write_discrete_greens_operator():matrix 3.54e+00 (sec)
status: computed 1 / pending 5903
write_discrete_greens_operator():facets 1.59e-02 (sec)
Computing Green's function 1 / 5904
write_discrete_greens_operator():matrix 3.36e+00 (sec)
status: computed 2 / pending 5902
write_discrete_greens_operator():facets 6.75e-03 (sec)
E.g. of timing:
Total time: 4.29e+00 sec
Open file: 4.90e-05 sec
Write commsize: 2.88e-01 sec
Write current_gf:2.17e-06 sec
MatView: 3.69e+00 sec
Close file: 3.16e-01 sec
Print status: 5.60e-05 sec
Write facet: 1.42e-03 sec
ok, I guess the problem is that the full green function (including the zeros) needs to be written at each call.
Here is an example of BP5 with the default mesh. Checkpointing 152Gb in 19min !!!
num_nodes: 6 ntasks: 288
___ ___ _____ ___ ___
___ / /\ /__/\ / /::\ / /\ /__/\
/ /\ / /::\ \ \:\ / /:/\:\ / /:/_ | |::\
/ /:/ / /:/\:\ \ \:\ / /:/ \:\ / /:/ /\ | |:|:\
/ /:/ / /:/~/::\ _____\__\:\ /__/:/ \__\:| / /:/ /:/_ __|__|:|\:\
/ /::\ /__/:/ /:/\:\/__/::::::::\\ \:\ / /://__/:/ /:/ /\/__/::::| \:\
/__/:/\:\\ \:\/:/__\/\ \:\~~\~~\/ \ \:\ /:/ \ \:\/:/ /:/\ \:\~~\__\/
\__\/ \:\\ \::/ \ \:\ ~~~ \ \:\/:/ \ \::/ /:/ \ \:\
\ \:\\ \:\ \ \:\ \ \::/ \ \:\/:/ \ \:\
\__\/ \ \:\ \ \:\ \__\/ \ \::/ \ \:\
\__\/ \__\/ \__\/ \__\/
tandem version ee87ac9
stack size limit = 2048 MiB
Worker affinity
0---------|----------|----------|----------|--------8-|----------|
----------|----------|----------|------
Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 167796 x 111864
partial_assemble_discrete_greens_function() [0 , 111864)
Computing Green's function 0 / 111864
write_discrete_greens_operator():matrix 1.14e+03 (sec)
status: computed 1 / pending 111863
write_discrete_greens_operator():facets 1.18e-02 (sec)
Computing Green's function 1 / 111864
Ok, it seems I fixed one of the problem with this simple commit:
https://github.com/TEAR-ERC/tandem/pull/72/commits/739b36d465dd74a032f582112b151cb80cd2a59c
Now checkpointing is much faster!
Multigrid P-levels: 1 2
Using GF checkpoint path: GreensFunctions/bp6_hf250
create_discrete_greens_function()
Green's function operator size: 167796 x 111864
partial_assemble_discrete_greens_function() [0 , 111864)
Computing Green's function 0 / 111864
write_discrete_greens_operator():matrix 1.55e+01 (sec)
status: computed 1 / pending 111863
write_discrete_greens_operator():facets 8.62e-03 (sec)
Computing Green's function 1 / 111864
write_discrete_greens_operator():matrix 1.65e+01 (sec)
status: computed 2 / pending 111862
write_discrete_greens_operator():facets 8.93e-03 (sec)
Computing Green's function 2 / 111864
write_discrete_greens_operator():matrix 1.56e+01 (sec)
status: computed 3 / pending 111861
write_discrete_greens_operator():facets 1.10e-02 (sec)
Computing Green's function 3 / 111864
write_discrete_greens_operator():matrix 1.62e+01 (sec)
and with https://github.com/TEAR-ERC/tandem/pull/72/commits/cfd7a258b9adb64e2ff8e47f8bad37b65732406a I fixed the rest of the issue.
Describe the bug I'm running BP5.toml based on this branch https://github.com/TEAR-ERC/tandem/pull/72 (at commit ee87ac9) which is a few commits on top of https://github.com/TEAR-ERC/tandem/pull/59
I changed res_f to 5 to have a very small mesh to test. Im BP5.toml, I add:
So that green functions are checkpointed every new green function. Generally it works. But it also happened several times that it was not able to restart. E.g. job killed during generation of GF:
Next job failing:
I noticed similar issues on kernelpanic.
Expected behavior the green function generation should have started again.
To Reproduce Steps to reproduce the behavior: spack intstalled on supermuc NG with:
spack install -j 30 tandem@tscp polynomial_degree=2 domain_dimension=3
Here is a list of the dependencies of tandem, and there specs: