Drones sometimes removed after payload has started running (race issue)

olifre commented 3 years ago

We observe a timeline like this in an astonishingly large number of cases:

Time	State	comment / interpretation
17:27:06	`BootingState`, `ResourceStatus.Booting: 1`
17:27:42	`shadow` of the LBS is born, i.e. drone actually starts execution	Note: TARDIS has not noticed this yet!
17:28:06	drone is put into `CleanupState`, TARDIS still thinks `ResourceStatus.Booting: 1`
17:28:07	condor_shadow logs: `Requesting graceful removal of job.`	TARDIS has issued `condor_rm`

At this point, the starter of the LBS confirms ShutdownGraceful all jobs. The payload is already running at this point, and killed without any real grace. Due to signal handling sometimes taking longer, especially with containers, we often have to wait for a bit afterwards until the job actually goes away, and still see this timeline (just for completeness, basically the "race" has already happened):

Time	State	comment / interpretation
17:29:06	`CleanupState`, `ResourceStatus.Booting: 1`
17:30:06	`DrainState`, `ResourceStatus.Running: 2`	Now the status update seems to have happened, but the deadly bomb is already ticking...
17:31:07	`DrainingState`, `ResourceStatus.Running: 2`
17:38:07	`startd` in LBS: `slot1_17: max vacate time expired. Escalating to a fast shutdown of the job.`	Finally, the drone is killed hard, including the payload.
17:40:07	`DownState`, `ResourceStatus.Deleted: 4`

I think the main issue is the initial race: A transition from BootingState to CleanupState can happen while the resource status is outdated, causing a drone with running payload to be killed. This appears to become rather common when many short-lived drones are used and drone count fluctuates significantly.

This causes the job from the OBS to be re-started elsewhere, which some VOs don't like.

I don't have a good suggestion on how to solve this, but one working "trick" (only applying to HTCondor LBS's) might be to add a constraint to condor_rm checking that the resource is still in the expected state to prevent removal if it is not. However, that can't be copied to other batch systems as-is. So maybe an explicit update of the ResourceStatus for the single resource before killing it could work, but that's of course some overhead.

(pinging also @wiene )

giffels commented 3 years ago

HI @olifre, thank you for this detailed report. We have thought about a potential solution or at least a way how to keep to impact as small as possible. We would propose the following.

Instead of removing drones in BootingState, we would propose to execute a condor_drain first. The condor_drain will fail if the resource is still in BootingState and 60s later condor_rm is executed. If the condor_status is meanwhile updated, the drone will be drained again. Still we do not catch cases where the drone is still in BootingState, while executing condor_drain and the condor_status update does not happen before the condor_rm is executed (about 1 min. later).

In addition, we would like to implement a new feature in COBalD (see https://github.com/MatterMiners/cobald/issues/89). So, COBalD would take into account the age of a drone while during the process of releasing drones. So, COBalD would release the latest spawned drones first.

Short form

[x] Replace the state transistion BootingState -> CleanupState by BootingState-> DrainState.
[ ] Modify COBalD to release the latest spawned drones first.

@olifre, @wiene: What do you think about this proposal?

maxfischer2781 commented 3 years ago

Pushing the discussion here instead of MatterMiners/cobald#89 since we're closer to the actual use-case here.

We seem to have two opposing age-based decisions to make:

For new drones, prefer disabling newer ones since they are less likely to have a/much payload
For established drones, prefer disabling older drones since they are more likely to have payload finishing soon

Either one alone seems straightforward but both at once is tricky. In the case we are facing here, I think gunning for oldest drones would actually work out, since it means we just never kill a booting drone. Would that work for other cases?

olifre commented 3 years ago

Hi together, indeed, the first step (replace BootingState -> CleanupState by BootingState-> DrainState) seems like a good idea to me, and I don't see any downsides — it definitely reduces the probability of a race :+1: .

Of course, @maxfischer2781 is right that for established drones, disabling older drones first seems the best age-based decision. This approach indeed might also work for new drones, at least of the top of my head, I don't see why it should not.

I have a third idea, which might be interesting (it came to me only now after reading @giffels nice explanation). However, I'm not sure if it matches the general structure sufficiently well to not cause a mess: When condor_drain fails (i.e. drone.batch_system_agent.drain_machine fails), would it be useful to raise a dedicated exception, and if that is raised, catch it in the state-changer, and call out to drone.site_agent and trigger a "deactivation" of the job?

Now the interesting question is what I think about for the deactivation. In terms of commands, I'd think about the following:

condor_drain fails on the drone of the OBS.
In the LBS, the drone is "deactivated" by setting a property of the job. In HTCondor, this could be implemented with condor_qedit, in Slurm with scontrol update JobID= .... If the actual payload starts at any point and sees this "flag of death", it will either stay idle or terminate itself.

So "deactivation" would be an inhibitor for the payload, and also not be harmful if the job actually is already running for some reason. The problem I see here is which property to change to have a portable solution (and keep the payload of the drone generic). Also, it seems to break with the idea of the state changer doing "one thing" in each state change (unless another state is introduced only for this). So I'm not sure if this idea is a good one, just wanted to throw it here ;-).

giffels commented 3 years ago

Hi @olifre,

thanks for sharind your input. I have already thought the other way around. Like putting something in place that avoids to start new payloads on the drone until it has been "integrated" in OBS. We could probably use the IntegrationState for that. For example we introduce a new ClassAds DroneReadyToGo = FALSE and put DroneReadyToGo in the START expression.

START = $(START) && $(DroneReadyToGo)

Later on we are going to set the value of DroneReadyToGo to TRUE via condor_update_machine_ad in the IntegrationState. However, I do not know how much overhead the condor_update_machine_ad is producing. We will probably loose scalability by going this way.

olifre commented 3 years ago

Hi @giffels ,

thinking about this the other way around in fact seems more reasonable than my idea, especially since it eases the abstraction, because you don't need to talk to the LBS. Of course, that "centralizes" the scaling issue. My expectation would be (without elaborate research) that condor_update_machine_ad has an overhead similar to condor_drain, since it needs to reverse-connect to the startd via the central CCB and shared_port_d, authenticate and then reconfigure. Most of that can be overcome with multiple CCBs and collectors, but since it modifies the machine ads which must be synchronized between collectors, even extra resources may not countereffect all of the scaling problem. Furthermore, that may indeed cause noticeable overhead since you'd need to run that for all drones and not just draining ones.

Thinking about "drone-readiness", that also means that another workaround without a scaling problem (but costing a bit of efficiency) could be to modify the START expression this way:

START = $(START) && (time()-MY.DaemonStartTime)>120

This should prevent the race since a drone can not accept jobs immediately, so while we could still condor_rm drones which have started in the small time window, they would not yet be running actual job payload. While that feels a bit like a "hack", costs two minutes of compute time by drone, and of course I did not test it yet, it might be a viable trick.

maxfischer2781 commented 3 years ago

I think delaying the initial start would just be pushing out the race condition, not fixing it. There's still an inherent non-zero delay for any kind of information in the Condor information system (and any other similar system). Unless we put all TARDIS delays well above these inherent delays, we can always run into such issues; and if we do make the delays as large, we have a significant boundary on what the system can do.

Going the route of graceful shutdown, e.g via condor_drain before a hard condor_rm, seems more robust.

olifre commented 3 years ago

@maxfischer2781 That's a good point, indeed those two minutes might not be sufficient at all, given the collector also caches state, and going to more macroscopic numbers will cause noticeable issues on other ends as you pointed out.

Then I'm out of alternate useful ideas and am all for approaching an age-based gunning to reduce the racing issue (let's hope nobody takes this statement out of context :wink: ).

I also believe that gunning for the oldest drones may work out best (but probably only trying and observing will really tell).

giffels commented 3 years ago

I would propose, we implement the age-based releasing of drones and the replaced state transistion and you @olifre give it a try?

[x] Replace the state transistion BootingState -> CleanupState by BootingState-> DrainState.
[ ] Modify COBalD to release the latest spawned drones first.

Does that sound reasonable?

maxfischer2781 commented 3 years ago

As for the second point, after some pondering I would go for deriving a FactoryPool specific for TARDIS drones (and put it in the TARDIS package). That would give us some insight into drone lifetime and allow for use-case specific tweaks. Any objections?

giffels commented 3 years ago

No objections, sounds like a solid approach. Go ahead!

olifre commented 3 years ago

That sounds great, we'll be happy to test :-).

eileen-kuehn commented 3 years ago

I'd like to motivate a more generic discussion about the concept that is currently implemented in TARDIS. I think the hot fix is a good start to mitigate the issue in the short term, but the underlying issue is still not considered and can even get worse in the future (patch on top of patch on top of ...). From my point of view, one of the underlying issues is assuming, that the information available to TARDIS, i.e. the information in the database, is correct. Here we are introducing a caching layer on top of HTCondor. This is great for several queries and I would still go for caching most recent queries. However, I am currently a bit afraid that we are losing some of the advantages of HTCondor in that specific case due to our additional layer of caching. I must admit, that I am not that deep into the code to propose an actual solution, but I would really be happy if we could make a small workshop (e.g. with a virtual whiteboard) around this topic to rethink some of our concepts.

giffels commented 3 years ago

Thanks @eileen-kuehn for suggesting a workshop about the concepts currently implemented in TARDIS, an idea that I fully support. In addition, I would like to already share two ideas that would completely solve this issue before I forget them again. ;-)

To be on the same page, the only affected drone state is the BootingState. For all other states the delay of information is not a problem at all.

The first approach is already mentioned above. A new ClassAd DroneReadyToGo is introduced and initially set to FALSE. By adding it to the START expression of the drone, no new payload is accepted. Once TARDIS realized that the drone is running, it is transitioning from BootingState to IntegratingState, where we can update the ClassAd DroneReadyToGo to TRUE by calling condor_update_machine_ad. This would cause some ovehead, since we have to call it once for every drone in the system and might lead to a reduced scalability.
The second approach is using a HOOK_PREPARE_JOB invoked by the condor_starter on the drone itself, before executing a payload. This hook can execute a command that could reach out to TARDIS and check if the information of the drone is up to date. For example: Calling a REST API that is returning the current state of the drone from the database. BootingState means wait a bit and try again. Every other state means go ahead and execute the payload. The overhead introduced should be manageable. Even with 100k job slots and an average job duration of 12 hours, we are talking about 2-3 req/s.
Can COBalD@Remote help here?

MatterMiners / tardis

Drones sometimes removed after payload has started running (race issue) #172