MatterMiners / tardis

Transparent Adaptive Resource Dynamic Integration System
https://cobald-tardis.readthedocs.io
MIT License
15 stars 20 forks source link

Fix updated timestamp #307

Closed giffels closed 9 months ago

giffels commented 11 months ago

The original idea was that created and updated timestamps indicate a change of the DroneState. However, in the meantime it was also updated in some SiteAdapters, when the resource status changed, e.g. through a resource_status call on certain SiteAdapters. Through the drone_minimum_lifetime setting seems to be ignored, because resource_status is called every minute, while drone_minimum_lifetime is usually in the order of hours.

Fixes #296

codecov-commenter commented 11 months ago

Codecov Report

All modified lines are covered by tests :white_check_mark:

Comparison is base (886255e) 98.90% compared to head (58fb461) 98.86%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #307 +/- ## ========================================== - Coverage 98.90% 98.86% -0.04% ========================================== Files 55 55 Lines 2277 2210 -67 ========================================== - Hits 2252 2185 -67 Misses 25 25 ``` | [Files](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners) | Coverage Δ | | |---|---|---| | [tardis/adapters/sites/cloudstack.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL2FkYXB0ZXJzL3NpdGVzL2Nsb3Vkc3RhY2sucHk=) | `100.00% <ø> (ø)` | | | [tardis/adapters/sites/fakesite.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL2FkYXB0ZXJzL3NpdGVzL2Zha2VzaXRlLnB5) | `100.00% <100.00%> (ø)` | | | [tardis/adapters/sites/htcondor.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL2FkYXB0ZXJzL3NpdGVzL2h0Y29uZG9yLnB5) | `100.00% <100.00%> (ø)` | | | [tardis/adapters/sites/kubernetes.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL2FkYXB0ZXJzL3NpdGVzL2t1YmVybmV0ZXMucHk=) | `100.00% <100.00%> (ø)` | | | [tardis/adapters/sites/moab.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL2FkYXB0ZXJzL3NpdGVzL21vYWIucHk=) | `100.00% <100.00%> (ø)` | | | [tardis/adapters/sites/openstack.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL2FkYXB0ZXJzL3NpdGVzL29wZW5zdGFjay5weQ==) | `100.00% <100.00%> (ø)` | | | [tardis/adapters/sites/slurm.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL2FkYXB0ZXJzL3NpdGVzL3NsdXJtLnB5) | `100.00% <100.00%> (ø)` | | | [tardis/interfaces/siteadapter.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL2ludGVyZmFjZXMvc2l0ZWFkYXB0ZXIucHk=) | `100.00% <100.00%> (ø)` | | | [tardis/resources/drone.py](https://app.codecov.io/gh/MatterMiners/tardis/pull/307?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=MatterMiners#diff-dGFyZGlzL3Jlc291cmNlcy9kcm9uZS5weQ==) | `100.00% <ø> (ø)` | |

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

giffels commented 11 months ago

I am currently not really happy about these changes in the Slurm, Moab and HTCondor site adapters https://github.com/MatterMiners/tardis/pull/307/commits/f2c9f1b7bbd75d236f73fe964fa863664cf63e63.

Previously the created timestamp has been updated in the corresponding site adaters and we could simply check (batch_system_last_status_update-created) < 0 to retry the resource_status call in case the asynchronously updated batch system status was last updated before the job was actually submited.

Is having a grace period enough or should we introduced a further timestamp for that use-case? Do you have any opionions on that? @MatterMiners/review

On the other hand it can possibly end up in a race condition.

BTW.: I checked that the delay introduced between created timestamp and the actually job subission is around 0.6s in most cases with some outliers at 2s.

maxfischer2781 commented 11 months ago

Is having a grace period enough or should we introduced a further timestamp for that use-case? Do you have any opionions on that? https://github.com/orgs/MatterMiners/teams/review

I think the "true" solution is to stop using asynccachemap for the entire queue and instead do an asyncbulkcall for those jobs we actually care about. This would once and for all fix the issue of outdated data and should still scale well enough.

In return, I think it is totally fine to introduce an additional timestamp as a stopgap solution to keep the asynccachemap approach running for now. Just pick a name that makes it clear what the timestamp is there for so that it doesn't get misused for something else.

giffels commented 11 months ago

Thanks a lot @maxfischer2781, that was the input I was looking for. 👍 Just to get the idea right, asyncbulkcall would in principle "block" all resource_status calls for a given time period, right?

maxfischer2781 commented 11 months ago

Yes, asyncbulkcall will asynchronously block until a delay or a specific amount of calls is reached. Since drones should mostly update in bulk (since they are all triggered by the Controller/FactoryPool at once or an internal timer) I think we can go for a small delay in the order of 0.1s-1s.

giffels commented 9 months ago

@maxfischer2781 and @mschnepf, I think we can continue reviewing this request. The code works at least with HoreKa.