Closed giffels closed 9 months ago
All modified lines are covered by tests :white_check_mark:
Comparison is base (
886255e
) 98.90% compared to head (58fb461
) 98.86%.
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
I am currently not really happy about these changes in the Slurm, Moab and HTCondor site adapters https://github.com/MatterMiners/tardis/pull/307/commits/f2c9f1b7bbd75d236f73fe964fa863664cf63e63.
Previously the created
timestamp has been updated in the corresponding site adaters and we could simply check (batch_system_last_status_update-created) < 0
to retry the resource_status
call in case the asynchronously updated batch system status was last updated before the job was actually submited.
Is having a grace period enough or should we introduced a further timestamp for that use-case? Do you have any opionions on that? @MatterMiners/review
On the other hand it can possibly end up in a race condition.
BTW.: I checked that the delay introduced between created
timestamp and the actually job subission is around 0.6s in most cases with some outliers at 2s.
Is having a grace period enough or should we introduced a further timestamp for that use-case? Do you have any opionions on that? https://github.com/orgs/MatterMiners/teams/review
I think the "true" solution is to stop using asynccachemap
for the entire queue and instead do an asyncbulkcall
for those jobs we actually care about. This would once and for all fix the issue of outdated data and should still scale well enough.
In return, I think it is totally fine to introduce an additional timestamp as a stopgap solution to keep the asynccachemap
approach running for now. Just pick a name that makes it clear what the timestamp is there for so that it doesn't get misused for something else.
Thanks a lot @maxfischer2781, that was the input I was looking for. 👍 Just to get the idea right, asyncbulkcall
would in principle "block" all resource_status
calls for a given time period, right?
Yes, asyncbulkcall
will asynchronously block until a delay or a specific amount of calls is reached. Since drones should mostly update in bulk (since they are all triggered by the Controller/FactoryPool at once or an internal timer) I think we can go for a small delay in the order of 0.1s-1s.
@maxfischer2781 and @mschnepf, I think we can continue reviewing this request. The code works at least with HoreKa.
The original idea was that
created
andupdated
timestamps indicate a change of theDroneState
. However, in the meantime it was also updated in someSiteAdapters
, when the resource status changed, e.g. through aresource_status
call on certainSiteAdapters
. Through thedrone_minimum_lifetime
setting seems to be ignored, becauseresource_status
is called every minute, whiledrone_minimum_lifetime
is usually in the order of hours.created
andupdated
timestamp from adaptersresource_attributes
dictionary, this is only done in theDroneStates
deploy_resource
andresource_status
should return only translated dictionaries of values to eventually modify theresource_attributes
dictionarystop_resource
andterminate_resource
should returnNone
AsyncBulkCall
for theHTCondor
,Moab
andSlurm
site adpter as discussed below.Fixes #296