cloudfoundry / diego-release

BOSH Release for Diego
Apache License 2.0
201 stars 212 forks source link

RetireActualLRP is not emitting events #820

Closed vlast3k closed 1 year ago

vlast3k commented 1 year ago

Summary

cf restart-app-instance <app> <index> ends up in calling bbs/controllers/actual_lrp_lifecycle_controller.go/RetireActualLRP - here. It the LRP is Claimed and running, it will not emit events.

As a consequence

The expected outcome of cf restart-app-instace is that:

Steps to Reproduce

Described here https://github.com/vlast3k/dontdie/tree/main

Diego repo

https://github.com/cloudfoundry/bbs/tree/main

Environment Details

diego-release - 2.80.0

Possible Causes or Fixes (optional)

The removeLRP method is only called if the LRP is Unclaimed or Crashed, or in case of errors - here

    removeLRP := func() error {
        err = h.db.RemoveActualLRP(ctx, logger, lrp.ProcessGuid, lrp.Index, &lrp.ActualLRPInstanceKey)
        if err == nil {
            newLRPs = eventCalculator.RecordChange(lrp, nil, lrps)
        }
        return err
    }

    for retryCount := 0; retryCount < models.RetireActualLRPRetryAttempts; retryCount++ {
        switch lrp.State {
        case models.ActualLRPStateUnclaimed, models.ActualLRPStateCrashed:
            err = removeLRP()
        case models.ActualLRPStateClaimed, models.ActualLRPStateRunning:
            cell, err = h.serviceClient.CellById(logger, lrp.CellId)

The change in this draft PR fixes the issue (w/o breaking existing tests) https://github.com/cloudfoundry/bbs/pull/72

mariash commented 1 year ago

@vlast3k I see the point that there might be a delay for the route to be deleted, since the actual LRP needs to be removed and then it will take converger to update BBS to remove actual LRP and emit event for router.

But I think it should be consistent with how we remove the desired LRP when we stop the app. In that case, we don't remove actual LRP, we only emit event. The actual LRP removal from BBS database should be triggered by rep when actual LRP is removed on the cell. Because BBS should rely on data from Rep for the actual state. BBS will try to converge to the desired state if deletion of actual LRP fails for example. Also rep will keep sending StartActualLRP requests if it has the LRP still running, so this removal from db will be unnecessary.

So instead of removeLRP call in the case of Claimed and Running, we should only emit an event to remove actual LRP for the router.

vlast3k commented 1 year ago

Hi @mariash thank you for the prompt feedback! i will change it and update the PR

mariash commented 1 year ago

Closing the issue since PR is merged