@wiene and I observe rare crashes with this trace (line numbers of the 0.5.0 version):
2021-02-20 01:41:11 [ERROR][cobald.runtime.runner.asyncio]: runner aborted: <cobald.daemon.runners.asyncio_runner.AsyncioRunner object at 0x7fbb5d697978>
Traceback (most recent call last):
File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/base_runner.py", line 62, in run
self._run()
File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/asyncio_runner.py", line 28, in _run
self.event_loop.run_until_complete(self._run_payloads())
File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
return future.result()
File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/asyncio_runner.py", line 36, in _run_payloads
await self._reap_payloads()
File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/asyncio_runner.py", line 58, in _reap_payloads
raise task.exception()
File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/async_tools.py", line 7, in raise_return
value = await payload()
File "/opt/cobald/lib64/python3.6/site-packages/tardis/resources/drone.py", line 94, in run
await current_state.run(self)
File "/opt/cobald/lib64/python3.6/site-packages/tardis/resources/dronestates.py", line 171, in run
drone_uuid=drone.resource_attributes["drone_uuid"]
File "/opt/cobald/lib64/python3.6/site-packages/tardis/agents/batchsystemagent.py", line 20, in get_allocation
return await self._batch_system_adapter.get_allocation(drone_uuid)
File "/opt/cobald/lib64/python3.6/site-packages/tardis/adapters/batchsystems/htcondor.py", line 195, in get_allocation
return max(await self.get_resource_ratios(drone_uuid), default=0.0)
File "/opt/cobald/lib64/python3.6/site-packages/tardis/adapters/batchsystems/htcondor.py", line 181, in <genexpr>
if key in self.ratios.keys()
ValueError: could not convert string to float: 'error'
This seems to happen always close / synchronous to the schedd negotiating with the LBS, i.e. new drones starting. Presumably, Cpus or Memory may not be well-defined for short time fractions in HTCondor (maybe when the pslot is in the process of being formed?).
Of course, this could be worked around with an ifThenElse, but maybe catching an error string in TARDIS and retrying later in that case might be more reliable :smile: .
@wiene and I observe rare crashes with this trace (line numbers of the 0.5.0 version):
So that happens in here: https://github.com/MatterMiners/tardis/blob/1139be80d885305ef8a62fca7ca49d8e15ca334e/tardis/adapters/batchsystems/htcondor.py#L172-L182
We set the following
BatchSystem.ratios
for our HTCondor LBS:This seems to happen always close / synchronous to the
schedd
negotiating with the LBS, i.e. new drones starting. Presumably,Cpus
orMemory
may not be well-defined for short time fractions in HTCondor (maybe when thepslot
is in the process of being formed?).Of course, this could be worked around with an
ifThenElse
, but maybe catching anerror
string in TARDIS and retrying later in that case might be more reliable :smile: .