MatterMiners / tardis

Transparent Adaptive Resource Dynamic Integration System
https://cobald-tardis.readthedocs.io
MIT License
14 stars 20 forks source link

Rare crash in `get_resource_ratios` when Slot variables are not defined yet #168

Closed olifre closed 3 years ago

olifre commented 3 years ago

@wiene and I observe rare crashes with this trace (line numbers of the 0.5.0 version):

2021-02-20 01:41:11 [ERROR][cobald.runtime.runner.asyncio]: runner aborted: <cobald.daemon.runners.asyncio_runner.AsyncioRunner object at 0x7fbb5d697978>
Traceback (most recent call last):
  File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/base_runner.py", line 62, in run
    self._run()
  File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/asyncio_runner.py", line 28, in _run
    self.event_loop.run_until_complete(self._run_payloads())
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 484, in run_until_complete
    return future.result()
  File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/asyncio_runner.py", line 36, in _run_payloads
    await self._reap_payloads()
  File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/asyncio_runner.py", line 58, in _reap_payloads
    raise task.exception()
  File "/opt/cobald/lib64/python3.6/site-packages/cobald/daemon/runners/async_tools.py", line 7, in raise_return
    value = await payload()
  File "/opt/cobald/lib64/python3.6/site-packages/tardis/resources/drone.py", line 94, in run
    await current_state.run(self)
  File "/opt/cobald/lib64/python3.6/site-packages/tardis/resources/dronestates.py", line 171, in run
    drone_uuid=drone.resource_attributes["drone_uuid"]
  File "/opt/cobald/lib64/python3.6/site-packages/tardis/agents/batchsystemagent.py", line 20, in get_allocation
    return await self._batch_system_adapter.get_allocation(drone_uuid)
  File "/opt/cobald/lib64/python3.6/site-packages/tardis/adapters/batchsystems/htcondor.py", line 195, in get_allocation
    return max(await self.get_resource_ratios(drone_uuid), default=0.0)
  File "/opt/cobald/lib64/python3.6/site-packages/tardis/adapters/batchsystems/htcondor.py", line 181, in <genexpr>
    if key in self.ratios.keys()
ValueError: could not convert string to float: 'error'

So that happens in here: https://github.com/MatterMiners/tardis/blob/1139be80d885305ef8a62fca7ca49d8e15ca334e/tardis/adapters/batchsystems/htcondor.py#L172-L182

We set the following BatchSystem.ratios for our HTCondor LBS:

    cpu_ratio: Real(TotalSlotCpus-Cpus)/TotalSlotCpus
    memory_ratio: Real(TotalSlotMemory-Memory)/TotalSlotMemory
    cpu_usage: IfThenElse(AverageCPUsUsage=?=undefined, 0, Real(AverageCPUsUsage))

This seems to happen always close / synchronous to the schedd negotiating with the LBS, i.e. new drones starting. Presumably, Cpus or Memory may not be well-defined for short time fractions in HTCondor (maybe when the pslot is in the process of being formed?).

Of course, this could be worked around with an ifThenElse, but maybe catching an error string in TARDIS and retrying later in that case might be more reliable :smile: .

giffels commented 3 years ago

@olifre thanks for reporting this. We will handle this in TARDIS.