adobe / aquarium-fish

Your best secure distributed heterogeneous dynamic compute resource manager for CI
Other
7 stars 2 forks source link

Fish: Deal with ELECTED Applications on restart #65

Open sparshev opened 1 month ago

sparshev commented 1 month ago

64 OOM triggered another issue - that if something happens during Application allocation - it will stuck in ELECTED state and will be abandoned after restart. Most probably that will not happen in the cluster (because multiple nodes will look at the election process and notice when the Application was not ALLOCATED in time), but that should not happen even in one node configurations.

Expected Behaviour

We need to clean-up and mark the Application Allocation as ERROR or try to continue the allocation if it's possible.

Actual Behaviour

The application is not picked up upon node startup

Reproduce Scenario (including but not limited to)

Should be relatively easy to reproduce by killing the node right during allocation and then restarting it

Steps to Reproduce

  1. Run the Fish node
  2. Try to Allocate something
  3. Kill the node during allocation (as hard as possible)
  4. Start the node and see that Application is not picked up and in ELECTED state forever

Platform and Version

Logs taken while reproducing problem