Open olifre opened 3 years ago
HI @olifre, thank you for this detailed report. We have thought about a potential solution or at least a way how to keep to impact as small as possible. We would propose the following.
Instead of removing drones in BootingState
, we would propose to execute a condor_drain
first. The condor_drain
will fail if the resource is still in BootingState
and 60s later condor_rm
is executed. If the condor_status
is meanwhile updated, the drone will be drained again. Still we do not catch cases where the drone is still in BootingState
, while executing condor_drain
and the condor_status
update does not happen before the condor_rm
is executed (about 1 min. later).
In addition, we would like to implement a new feature in COBalD
(see https://github.com/MatterMiners/cobald/issues/89). So, COBalD
would take into account the age of a drone while during the process of releasing drones. So, COBalD
would release the latest spawned drones first.
Short form
BootingState
-> CleanupState
by BootingState
-> DrainState
.COBalD
to release the latest spawned drones first.@olifre, @wiene: What do you think about this proposal?
Pushing the discussion here instead of MatterMiners/cobald#89 since we're closer to the actual use-case here.
We seem to have two opposing age-based decisions to make:
Either one alone seems straightforward but both at once is tricky. In the case we are facing here, I think gunning for oldest drones would actually work out, since it means we just never kill a booting drone. Would that work for other cases?
Hi together,
indeed, the first step (replace BootingState
-> CleanupState
by BootingState
-> DrainState
) seems like a good idea to me, and I don't see any downsides — it definitely reduces the probability of a race :+1: .
Of course, @maxfischer2781 is right that for established drones, disabling older drones first seems the best age-based decision. This approach indeed might also work for new drones, at least of the top of my head, I don't see why it should not.
I have a third idea, which might be interesting (it came to me only now after reading @giffels nice explanation). However, I'm not sure if it matches the general structure sufficiently well to not cause a mess:
When condor_drain
fails (i.e. drone.batch_system_agent.drain_machine
fails), would it be useful to raise a dedicated exception, and if that is raised, catch it in the state-changer, and call out to drone.site_agent
and trigger a "deactivation" of the job?
Now the interesting question is what I think about for the deactivation. In terms of commands, I'd think about the following:
condor_drain
fails on the drone of the OBS. condor_qedit
, in Slurm with scontrol update JobID= ...
. If the actual payload starts at any point and sees this "flag of death", it will either stay idle or terminate itself. So "deactivation" would be an inhibitor for the payload, and also not be harmful if the job actually is already running for some reason. The problem I see here is which property to change to have a portable solution (and keep the payload of the drone generic). Also, it seems to break with the idea of the state changer doing "one thing" in each state change (unless another state is introduced only for this). So I'm not sure if this idea is a good one, just wanted to throw it here ;-).
Hi @olifre,
thanks for sharind your input. I have already thought the other way around. Like putting something in place that avoids to start new payloads on the drone until it has been "integrated" in OBS. We could probably use the IntegrationState
for that. For example we introduce a new ClassAds DroneReadyToGo = FALSE
and put DroneReadyToGo
in the START expression.
START = $(START) && $(DroneReadyToGo)
Later on we are going to set the value of DroneReadyToGo
to TRUE
via condor_update_machine_ad
in the IntegrationState
. However, I do not know how much overhead the condor_update_machine_ad
is producing. We will probably loose scalability by going this way.
Hi @giffels ,
thinking about this the other way around in fact seems more reasonable than my idea, especially since it eases the abstraction, because you don't need to talk to the LBS. Of course, that "centralizes" the scaling issue. My expectation would be (without elaborate research) that condor_update_machine_ad
has an overhead similar to condor_drain
, since it needs to reverse-connect to the startd
via the central CCB and shared_port_d
, authenticate and then reconfigure. Most of that can be overcome with multiple CCBs and collectors, but since it modifies the machine ads which must be synchronized between collectors, even extra resources may not countereffect all of the scaling problem. Furthermore, that may indeed cause noticeable overhead since you'd need to run that for all drones and not just draining ones.
Thinking about "drone-readiness", that also means that another workaround without a scaling problem (but costing a bit of efficiency) could be to modify the START
expression this way:
START = $(START) && (time()-MY.DaemonStartTime)>120
This should prevent the race since a drone can not accept jobs immediately, so while we could still condor_rm
drones which have started in the small time window, they would not yet be running actual job payload.
While that feels a bit like a "hack", costs two minutes of compute time by drone, and of course I did not test it yet, it might be a viable trick.
I think delaying the initial start would just be pushing out the race condition, not fixing it. There's still an inherent non-zero delay for any kind of information in the Condor information system (and any other similar system). Unless we put all TARDIS delays well above these inherent delays, we can always run into such issues; and if we do make the delays as large, we have a significant boundary on what the system can do.
Going the route of graceful shutdown, e.g via condor_drain
before a hard condor_rm
, seems more robust.
@maxfischer2781 That's a good point, indeed those two minutes might not be sufficient at all, given the collector also caches state, and going to more macroscopic numbers will cause noticeable issues on other ends as you pointed out.
Then I'm out of alternate useful ideas and am all for approaching an age-based gunning to reduce the racing issue (let's hope nobody takes this statement out of context :wink: ).
I also believe that gunning for the oldest drones may work out best (but probably only trying and observing will really tell).
I would propose, we implement the age-based releasing of drones and the replaced state transistion and you @olifre give it a try?
BootingState
-> CleanupState
by BootingState
-> DrainState
.COBalD
to release the latest spawned drones first.Does that sound reasonable?
As for the second point, after some pondering I would go for deriving a FactoryPool specific for TARDIS drones (and put it in the TARDIS package). That would give us some insight into drone lifetime and allow for use-case specific tweaks. Any objections?
No objections, sounds like a solid approach. Go ahead!
That sounds great, we'll be happy to test :-).
I'd like to motivate a more generic discussion about the concept that is currently implemented in TARDIS. I think the hot fix is a good start to mitigate the issue in the short term, but the underlying issue is still not considered and can even get worse in the future (patch on top of patch on top of ...). From my point of view, one of the underlying issues is assuming, that the information available to TARDIS, i.e. the information in the database, is correct. Here we are introducing a caching layer on top of HTCondor. This is great for several queries and I would still go for caching most recent queries. However, I am currently a bit afraid that we are losing some of the advantages of HTCondor in that specific case due to our additional layer of caching. I must admit, that I am not that deep into the code to propose an actual solution, but I would really be happy if we could make a small workshop (e.g. with a virtual whiteboard) around this topic to rethink some of our concepts.
Thanks @eileen-kuehn for suggesting a workshop about the concepts currently implemented in TARDIS
, an idea that I fully support. In addition, I would like to already share two ideas that would completely solve this issue before I forget them again. ;-)
To be on the same page, the only affected drone state is the BootingState
. For all other states the delay of information is not a problem at all.
The first approach is already mentioned above. A new ClassAd DroneReadyToGo
is introduced and initially set to FALSE
. By adding it to the START expression of the drone, no new payload is accepted. Once TARDIS
realized that the drone is running, it is transitioning from BootingState
to IntegratingState
, where we can update the ClassAd DroneReadyToGo
to TRUE
by calling condor_update_machine_ad
. This would cause some ovehead, since we have to call it once for every drone in the system and might lead to a reduced scalability.
The second approach is using a HOOK_PREPARE_JOB
invoked by the condor_starter
on the drone itself, before executing a payload. This hook can execute a command that could reach out to TARDIS
and check if the information of the drone is up to date.
For example: Calling a REST API that is returning the current state of the drone from the database. BootingState
means wait a bit and try again. Every other state means go ahead and execute the payload. The overhead introduced should be manageable. Even with 100k job slots and an average job duration of 12 hours, we are talking about 2-3 req/s.
Can COBalD@Remote help here?
We observe a timeline like this in an astonishingly large number of cases:
BootingState
,ResourceStatus.Booting: 1
shadow
of the LBS is born, i.e. drone actually starts executionCleanupState
, TARDIS still thinksResourceStatus.Booting: 1
Requesting graceful removal of job.
condor_rm
At this point, the
starter
of the LBS confirmsShutdownGraceful all jobs
. The payload is already running at this point, and killed without any real grace. Due to signal handling sometimes taking longer, especially with containers, we often have to wait for a bit afterwards until the job actually goes away, and still see this timeline (just for completeness, basically the "race" has already happened):CleanupState
,ResourceStatus.Booting: 1
DrainState
,ResourceStatus.Running: 2
DrainingState
,ResourceStatus.Running: 2
startd
in LBS:slot1_17: max vacate time expired. Escalating to a fast shutdown of the job.
DownState
,ResourceStatus.Deleted: 4
I think the main issue is the initial race: A transition from
BootingState
toCleanupState
can happen while the resource status is outdated, causing a drone with running payload to be killed. This appears to become rather common when many short-lived drones are used and drone count fluctuates significantly.This causes the job from the OBS to be re-started elsewhere, which some VOs don't like.
I don't have a good suggestion on how to solve this, but one working "trick" (only applying to HTCondor LBS's) might be to add a constraint to
condor_rm
checking that the resource is still in the expected state to prevent removal if it is not. However, that can't be copied to other batch systems as-is. So maybe an explicit update of theResourceStatus
for the single resource before killing it could work, but that's of course some overhead.(pinging also @wiene )