celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/user/index.html
Other
62 stars 33 forks source link

Set global execution timeout for automated testing on Jenkins #1186

Closed dalg24 closed 5 months ago

dalg24 commented 5 months ago

Set a period timeout after which the Jenkins server will abort the pipeline run. Without it, a jobs that somehow gets stuck may run for days before it is manually killed by an admin. Feel free to adjust the time value.

dalg24 commented 5 months ago

For reference we use 6hrs on Kokkos, 3hrs on ArborX and Cabana

sethrj commented 5 months ago

Cool. I've never seen a celeritas job take more than half an hour once it's started.

sethrj commented 5 months ago

@dalg24 Can you explain why this CI job "took 2 hours" (and died) after 3 minutes into the build? https://cloud.cees.ornl.gov/jenkins-ci/job/celeritas/job/PR-1189/2/pipeline-console/?selected-node=14

Does 2 hours include the time the job spends waiting for Kokkos to do its multi-hour builds? 😅

dalg24 commented 5 months ago

The timeout includes the waiting in the queue. I am not aware of a way to configure it to be actual run time. In any case we do want some upper limit for the whole process. Feel free to increase it again to match what other projects do.

sethrj commented 5 months ago

Arg. That means our job successes are directly linked to Kokkos' run times. Is there a way to resubmit the jobs that failed because they got stuck behind one or more Kokkos CI sets? (Besides pull requests, the develop branch will also experience failures.)

dalg24 commented 5 months ago

I have no experience with it but you can look at https://www.jenkins.io/doc/pipeline/steps/workflow-basic-steps/#retry-retry-the-body-up-to-n-times but obviously you'd need to come up with a condition that can clearly identify that the failure was a timeout.