Open giffels opened 3 years ago
The general problem in Grid clusters is separating many starts on different nodes from many starts on the same node. Do the adapters have a means to expose such per-node statistics?
No, it does not (yet). However, using Drone life time seems to be a good measure instead.
Recently, a black hole like situation has occurred on one of our HPC clusters. The automated configuration of HTCondor on the Drone has not worked anymore, due to a full system disk on the remote git server.
TARDIS
relentlessly tried to boot up new Drones, which end up in a sort of DDoS situation on the remote git server.Would be nice to implement a mechanism, that stops deploying new Drones if the life time of a Drone is too short or to many Drones are spawned in a defined interval.