Closed shreyb closed 3 months ago
Done. Unit tests all pass, and tomorrow I'll do end-to-end tests with this to make sure it works and at some scale.
Some race-condition issues came up in the end-to-end testing that I had to resolve. Now both unit and end-to-end tests pass.
We've had a lot of condor infrastructure being taken out of service to reinstall those machines as Alma 9. That leads to various managed tokens failures, when the proper collector or schedd can't be contacted. Since the schedd list is generally populated from the collector by running
condor_status -pool <COLLECTOR> -schedd
, we should implement a failover whereby we can specify in the config:And the code will try each collector (perhaps randomly) to get the schedds before returning an error.