SGpp / DisCoTec

MPI-based code for distributed HPC simulations with the sparse grid combination technique. Docs->(https://discotec.readthedocs.io/)
https://sparsegrids.org/
GNU Lesser General Public License v3.0
8 stars 7 forks source link

[JOSS] SGppDistributedCombigridModule test failing #135

Closed jakelangham closed 1 month ago

jakelangham commented 2 months ago

This issue relates to ongoing reviews at https://github.com/openjournals/joss-reviews/issues/7018.

I have been trying to get the tests working on a cluster I use. I've done a fresh install with spack, following the documentation and submitted with the attached script.

#!/bin/bash
#SBATCH --job-name=discotest
#SBATCH --mem-per-cpu 8000M
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=9
#SBATCH --cpus-per-task=1
#SBATCH --time=01-00:00:0

cd $SLURM_SUBMIT_DIR

[path-to-mpiexec] -np 9 ./test_distributedcombigrid_boost

where path-to-mpiexec refers to the binary that spack locally installed. (N.b. the install documentation assumes that mpiexec is already in $PATH which I think would not be true by default if following the spack route.)

While some tests seem to run ok, it eventually outptuts some rather formidable error messages, purportedly related to the SGppDistributedCombigridModule. I attach the output file in the hope that you can make sense of it slurm-10530966.txt. Perhaps this could be a local problem with my install / setup but I'm not exactly sure how to fix it.

freifrauvonbleifrei commented 2 months ago

Hi @jakelangham , thanks for testing the tests!

Yes, I agree that we should mention mpiexec has to be the one belonging to the MPI used in Spack.

Yes, this last set of tests! they are related to the widely-distributed combination technique, based on file exchange. I believe there is some kind of race condition happening due to lacking synchronzation through the file system (only sometimes). We also see them occasionally in the Jenkins CI. The error you linked sounds like some outdated file is read (which is not large enough).

Right now, I am unsure how to go about this (debugging could be quick and just need a barrier, or it could take ages).

could you maybe delete the temporary files and run only ${mpiexec} -n 9 ./test_distributedcombigrid_boost --run_test=thirdLevel --log_level=message to see which exact one it is on your machine?

jakelangham commented 2 months ago

Absolutely - see the attached log

slurm-10531195.txt

freifrauvonbleifrei commented 2 months ago

Hm, this output looks like the test suite just didn't finish in time, which could be fixed by a longer test timeout ;)

But this test suite is really volatile because of the files, I am considering to default-disable them in addition, but leave them in the CI so we know in case they REALLY break. What do you think @jakelangham ?

jakelangham commented 2 months ago

I don't have a problem with that, except I'd like to verify that I can get them to work as expected first. Is there a header file or parameter somewhere I can tweak for this?

freifrauvonbleifrei commented 1 month ago

Sure! in this file https://github.com/SGpp/DisCoTec/blob/main/tests/test_thirdLevel.cpp you can search for boost::unit_test::timeout . The first one you find, on line 861, is for the whole test suite and the others are for the individual tests. You can also remove them entirely, if you remember to stop the process at some point in case it hangs ;)

Before you run again, make sure to run git clean -n and delete all untracked files in the tests folder, to avoid interference from previously aborted runs' files.

jakelangham commented 1 month ago

Hi @freifrauvonbleifrei. Thanks for this. I have been able to get the test to execute successfully. I had to increase those timeouts quite significantly in the end. Nevertheless, it seems to work so that's good to see. Therefore, I consider this resolved from my end now.

freifrauvonbleifrei commented 1 month ago

I increased the timeouts (in https://github.com/SGpp/DisCoTec/commit/b6241c2237e6e206b9ba13714bcd0eab5897b01c) and default-disabled the file-based tests. thanks for pointing it out!