FlowFuse / driver-k8s

Kubernetes container driver for FlowFuse
Apache License 2.0
2 stars 3 forks source link

Race on eventually consistent operations of provision/deprovision of Project's k8s resources [e.g. service, ingress] #114

Closed elenaviter closed 1 year ago

elenaviter commented 1 year ago

Current Behavior

Symptoms

After Project is created/started: project is indicated as up an running but it is unreachable via Editor link. For ops its very obvious because they can see the ingresses and simply detect the problem. But this is not obvious given with FlowFuse interface only where the project is stated as running and the link to the Editor is available but never get to work.

Problems description

There are 2 problems with the way how k8s resources are provisioned and deprovisioned, for the Project:

  1. Dependent provisioning operations are done concurrently instead of sequentially.
  2. In deprovision scenario (e.g. "suspend"), the state of the project is changed optimistically w/o guarantee (and handling) of the actual state of the resources it is bound to, which allows the clients to change the state of the project further while the previous operations are still in progress.

Problems detailed description

  1. In "createProject", it's a race condition between the operations that are almost simultaneously pushed onto concurrent promises list which allows them to execute in any order and priority. Whilst creation of a service depends on creation of deployment, and creation of ingress depends on creation of a service. This can lead to situation when the ingress won't be created in first place because service creation is yet not started or in flight, and k8s API will return, in attempt to create the ingress for service, "Not found" error. E..g instead of dealing with a list of concurrent promises

    return Promise.all(promises).then(async () => {...}

    must be smth like

    await this._k8sAppApi.createNamespacedDeployment(...);
    // poll, verify it's there
    await this._k8sApi.createNamespacedService(...);
    // poll, verify it's there
    await this._k8sNetApi.createNamespacedIngress(...);
    // poll, verify it's there
    // Done with resources
    this._app.log.debug(`[k8s] Container ${project.id} started`)
    project.state = 'running'
    await project.save()
    this._projects[project.id].state = 'starting'
  2. In "suspend" project handler, the operations of deprovisioning service and ingress are not guaranteed to be accomplished as a result of the handler execution (however the project's state is moved to "suspended" which unlocks consequential state change actions on it, by the clients). This problem is introduced by this commit which aims to deprovision the networking resources associated with the project when the project is suspended. k8s operations that provision/deprovision/modify resource (e.g. 'delete service', 'delete ingress') are "promises" for the client of this API. They just send the request onto queue of requests for the relevant k8s service. Thus in order to guarantee the operation was completed, corresponding resources should be polled in order to ensure they are left in desired state. This polling is already done in "suspend" handler for deployments but not for service and ingress resources.

Additional approach when concurrently work with [k8s] resources is the "[Kubernetes] Resource Versioning" so that the originator of the change would know if the change it conducted was applied ('no-lock write attempt').

Expected Behavior

Steps To Reproduce

This is especially simple to reproduce when you try to change the stack for the project. Starting from recently, changing the stack automatically suspends and then starts the project.

Above steps 100% lead to the Project unreachability via ingress after attempt to change the stack.

To recover, one has to suspend the instance, wait for a bit and then start it again.

Environment

k8s setup

hardillb commented 1 year ago

Thanks for raising this. I'll have a look.

I've not seen this on my local deployment or on FlowFuse Cloud. I assume you are running on a K8s cluster with a lot more going on which is causing slight delays in provisioning artefacts?

elenaviter commented 1 year ago

Thanks for raising this. I'll have a look.

I've not seen this on my local deployment or on FlowFuse Cloud. I assume you are running on a K8s cluster with a lot more going on which is causing slight delays in provisioning artefacts?

True about delays. Actually, I have 5 clusters now, they are very different in their nature and scale. The problem is 100% reproducible on each of them. it's good case to cover.

hardillb commented 1 year ago

@elenaviter can you have a look at the linked PR, I think that should fix the problem, but as I can't reproduce the issue it's hard to test. If you can build locally and test that would be even better.

Thanks.

elenaviter commented 1 year ago

Hey @hardillb

I hope this message finds you well. My apologies for the delayed response due to a recent illness. I'm now back and have had the opportunity to review the PR in question.

Observations on the PR

Upon review and subsequent adjustments to the PR, the initial issue - "after suspension, the project cannot be started normally" - has been resolved. However, a subsequent challenge has emerged related to the discrepancy between refreshing auth tokens for the Project and the actual application of these new tokens in Project deployment.

Identified Issue: Token Refresh and Deployment Discrepancy

The refreshed tokens are optimistically saved to the database during the assembly of metadata for project deployment within the createDeployment() routine:

Copy code
const authTokens = await project.refreshAuthTokens()
...
await project.save()

However, the deployment, where these tokens are intended to be applied, does not necessarily succeed. The attempt to create a k8s deployment may fail if, for some reason, the deployment existed previously.

The provisioning of k8s deployment will assuredly fail during FF upgrade if running projects exist due to the following sequence:

Consequently:

Existing deployments continue running with old, now unusable auth tokens. This prevents FF from communicating with the Editor using new auth tokens and client id, and the launcher (if needed) from communicating with FF using the old auth token, according to the one it has in its deployment.

I will make PR to your branch that solve both issues