apache / openwhisk

Apache OpenWhisk is an open source serverless cloud platform
https://openwhisk.apache.org/
Apache License 2.0
6.56k stars 1.17k forks source link

Send a queue removed message to the queue manager #5391

Closed style95 closed 1 year ago

style95 commented 1 year ago

This is to handle the issue reported in https://github.com/apache/openwhisk/pull/5388 in another way.

Description

Related issue and scope

My changes affect the following components

Types of changes

Checklist:

codecov-commenter commented 1 year ago

Codecov Report

Merging #5391 (c1c8ca1) into master (60ca660) will decrease coverage by 0.27%. The diff coverage is 100.00%.

:exclamation: Current head c1c8ca1 differs from pull request most recent head 3e99442. Consider uploading reports for the commit 3e99442 to get more accurate results

@@            Coverage Diff             @@
##           master    #5391      +/-   ##
==========================================
- Coverage   76.91%   76.64%   -0.27%     
==========================================
  Files         240      240              
  Lines       14588    14595       +7     
  Branches      629      636       +7     
==========================================
- Hits        11221    11187      -34     
- Misses       3367     3408      +41     
Impacted Files Coverage Δ
...ntainerpool/v2/FunctionPullingContainerProxy.scala 78.65% <100.00%> (+0.17%) :arrow_up:
...e/openwhisk/core/scheduler/queue/MemoryQueue.scala 80.99% <100.00%> (-1.42%) :arrow_down:

... and 11 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

bdoyle0182 commented 1 year ago

This LGTM we need this as well but I think it's important to have the additional safeguard from the state timeout never occurring in the removing state creating an activation cycle such that the queue is never shut down so will still go ahead with merging that as well if you approve it.

bdoyle0182 commented 1 year ago

The edge case also occurs from this event

    case Event(StateTimeout, _: NoActors) =>
      logging.info(this, s"[$invocationNamespace:$action:$stateName] The queue is timed out, stop the queue.")
      cleanUpDataAndGotoRemoved()

which this change wouldn't cover.

I think unit tests will need to be updated to make sure these changes don't have any side effects where sending queue removed to the parent occurs too early in the removal process

style95 commented 1 year ago

@bdoyle0182 My bad, I added another missing part.

So both cleanUpDataAndGotoRemoved() and cleanUpActorsAndGotoRemoved() should send a queue removed message to the queue manager.

bdoyle0182 commented 1 year ago

Since this will be called from every shutdown case, I just want to make sure that there aren't any new race conditions created. Would it be possible for the queue removed message to be sent twice if it's called from here now and it was sent from another shutdown case such that the queue is removed from the manager map and then a new queue is created and this fsm sends another queue removed message removing it from the parent map after the new queue is created? I'm not sure if that could happen or matters, but want to check

bdoyle0182 commented 1 year ago

Will approve this so long as we're confident sending queue removed to the parent from these locations cannot create a new race condition.

style95 commented 1 year ago

Once the queue manager receives a queue removed message, it will just remove the corresponding reference from queue pool and send an ack back to the sender. If there is any logic to re-add the reference to the queue pool then there might be a race condition, but it always removes the reference and no race condition happens. https://github.com/apache/openwhisk/blob/master/core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/QueueManager.scala#L204

One concern could be multiple acks sent from the queue manager, but it will also be handled by the wildcard case if there is no match for a certain state. https://github.com/apache/openwhisk/blob/master/core/scheduler/src/main/scala/org/apache/openwhisk/core/scheduler/queue/MemoryQueue.scala#L670

bdoyle0182 commented 1 year ago

LGTM