Allow for condor collector failover

shreyb commented 7 months ago

We've had a lot of condor infrastructure being taken out of service to reinstall those machines as Alma 9. That leads to various managed tokens failures, when the proper collector or schedd can't be contacted. Since the schedd list is generally populated from the collector by running condor_status -pool <COLLECTOR> -schedd, we should implement a failover whereby we can specify in the config:

collectorHost: "collector1,collector2"

And the code will try each collector (perhaps randomly) to get the schedds before returning an error.

shreyb commented 3 months ago

Done. Unit tests all pass, and tomorrow I'll do end-to-end tests with this to make sure it works and at some scale.

shreyb commented 3 months ago

Some race-condition issues came up in the end-to-end testing that I had to resolve. Now both unit and end-to-end tests pass.

fermitools / managed-tokens

Allow for condor collector failover #87