eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
64 stars 14 forks source link

debug CI #1057

Closed rasolca closed 7 months ago

rasolca commented 10 months ago

Just trying to debug problems like: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5645709351

rasolca commented 10 months ago

cscs-ci run

rasolca commented 10 months ago

cscs-ci run

codecov-commenter commented 10 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Comparison is base (ab2bb6f) 94.02% compared to head (dbafede) 94.02%. Report is 1 commits behind head on master.

:exclamation: Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1057 +/- ## ======================================= Coverage 94.02% 94.02% ======================================= Files 145 145 Lines 8955 8955 Branches 1142 1142 ======================================= Hits 8420 8420 Misses 319 319 Partials 216 216 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

rasolca commented 10 months ago

cscs-ci run

rasolca commented 10 months ago

cscs-ci run

rasolca commented 10 months ago

cscs-ci run

msimberg commented 10 months ago

I don't know if it helps, but it seems to also have happened on a non-codecov configuration: https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5676897631.

rasolca commented 7 months ago

As the timeout is showing up more often, I would try to upstream this changes.

rasolca commented 7 months ago

@rasolca no objection to this. Do I understand correctly that setting SLURM_WAIT=0 just means wait forever for the job to finish (https://slurm.schedmd.com/srun.html#OPT_wait)? And then we rely on the gitlab job timeout to kill the job instead if it hangs?

Not fully correct. When one of the processes terminates, slurm expect all other processes to terminate within the wait time, otherwise it kills them. SLURM_WAIT=0 just disable this behaviour. If the job reaches the time limit it is still killed.

rasolca commented 7 months ago

cscs-ci run