Implement ECS task concurrency prevention for registry-sweepers

NASA-PDS / planetary-data-cloud

PDS Cloud Migration documentation, issue, tracking and simple tools for assisting in the PDS hybrid cloud study and migration efforts.

Apache License 2.0

0 stars 0 forks source link

Implement ECS task concurrency prevention for registry-sweepers #105

Open alexdunnjpl opened 2 months ago

alexdunnjpl commented 2 months ago

💡 Description

Currently, if a sweeper executes for longer than its schedule cadence, multiple instances of the sweeper will run concurrently.

This causes additional cost due to both redundant processing and a slowdown of all jobs due to increased database load, and could affect service if the database is loaded heavily enough.

Implement configuration to allow execution of <=1 container instance per task definition (i.e. node) at any point in time.

@jordanpadams this isn't blocking anything, but the sooner it's done, the shorter we can make our sweepers cadence and the performance/cost impact is nontrivial.

jordanpadams commented 2 months ago

@alexdunnjpl when you say "implement configuration" is this an event scheduler configuration?

alexdunnjpl commented 2 months ago

@jordanpadams I'm fuzzy on the details, but I think it requires defining a cluster for each task definition and setting a container limit on each cluster. Simply, "do some AWS Console stuff"

@sjoshi-jpl will have a better idea of the details I suspect

jordanpadams commented 2 months ago

Thanks @alexdunnjpl. As a task, this is 100% going to get lost in the 100s of tickets we have open right now. I will try to keep track of this and add to our overall release plan.

alexdunnjpl commented 2 months ago

The need for this should be somewhat mitigated (though not completely avoided) by https://github.com/NASA-PDS/registry-sweepers/pull/115 as now, only provenance should result in any redundant work being done.

EDIT Actually this is incorrect - there's still a concern of multiple instances tripping over each other in the event of an influx of data which causes >cadencePeriod container runtime

alexdunnjpl commented 2 months ago

Possibly-related:

https://github.com/NASA-PDS/registry-sweepers/issues/31 https://github.com/NASA-PDS/registry-sweepers/issues/60