MatterMiners / tardis

Transparent Adaptive Resource Dynamic Integration System
https://cobald-tardis.readthedocs.io
MIT License
14 stars 20 forks source link

Avoid black hole like situations #205

Open giffels opened 3 years ago

giffels commented 3 years ago

Recently, a black hole like situation has occurred on one of our HPC clusters. The automated configuration of HTCondor on the Drone has not worked anymore, due to a full system disk on the remote git server. TARDIS relentlessly tried to boot up new Drones, which end up in a sort of DDoS situation on the remote git server.

Would be nice to implement a mechanism, that stops deploying new Drones if the life time of a Drone is too short or to many Drones are spawned in a defined interval.

maxfischer2781 commented 3 years ago

The general problem in Grid clusters is separating many starts on different nodes from many starts on the same node. Do the adapters have a means to expose such per-node statistics?

giffels commented 3 years ago

No, it does not (yet). However, using Drone life time seems to be a good measure instead.