DUNE / dist-comp

Action items for DUNE distributed computing, and common scripts that are used.
2 stars 0 forks source link

JustIN apparently marking large fractions of jobs stalled when it should not be #156

Closed StevenCTimm closed 3 weeks ago

StevenCTimm commented 2 months ago

It's not clear if this is a HTCondor issue or a JustIN issue, but investigation shows that 90-95% of the jobs that JustIN is marking as "stalled" in fact are running to completion. More investigation is needed to see why.

Andrew-McNab-UK commented 3 weeks ago

This was a transitory problem with the HTCondor schedds overloaded compounded by faulty checking of HTCondor return codes which was fixed in 01.01