Closed elenaviter closed 1 year ago
Thanks for raising this. I'll have a look.
I've not seen this on my local deployment or on FlowFuse Cloud. I assume you are running on a K8s cluster with a lot more going on which is causing slight delays in provisioning artefacts?
Thanks for raising this. I'll have a look.
I've not seen this on my local deployment or on FlowFuse Cloud. I assume you are running on a K8s cluster with a lot more going on which is causing slight delays in provisioning artefacts?
True about delays. Actually, I have 5 clusters now, they are very different in their nature and scale. The problem is 100% reproducible on each of them. it's good case to cover.
@elenaviter can you have a look at the linked PR, I think that should fix the problem, but as I can't reproduce the issue it's hard to test. If you can build locally and test that would be even better.
Thanks.
Hey @hardillb
I hope this message finds you well. My apologies for the delayed response due to a recent illness. I'm now back and have had the opportunity to review the PR in question.
Upon review and subsequent adjustments to the PR, the initial issue - "after suspension, the project cannot be started normally" - has been resolved. However, a subsequent challenge has emerged related to the discrepancy between refreshing auth tokens for the Project and the actual application of these new tokens in Project deployment.
The refreshed tokens are optimistically saved to the database during the assembly of metadata for project deployment within the createDeployment()
routine:
Copy code
const authTokens = await project.refreshAuthTokens()
...
await project.save()
However, the deployment, where these tokens are intended to be applied, does not necessarily succeed. The attempt to create a k8s deployment may fail if, for some reason, the deployment existed previously.
The provisioning of k8s deployment will assuredly fail during FF upgrade if running projects exist due to the following sequence:
Here there's a bug, according to which we searched for deployments in the incorrect namespace giving the "project does not exist" result, and consequently leading to
createProject
execution for all projects registered in FF (except suspended ones), thereby invalidating the auth tokens for those projects without reapplying refreshed tokens to deployments.
Existing deployments continue running with old, now unusable auth tokens. This prevents FF from communicating with the Editor using new auth tokens and client id, and the launcher (if needed) from communicating with FF using the old auth token, according to the one it has in its deployment.
I will make PR to your branch that solve both issues
Current Behavior
Symptoms
After Project is created/started: project is indicated as up an running but it is unreachable via Editor link. For ops its very obvious because they can see the ingresses and simply detect the problem. But this is not obvious given with FlowFuse interface only where the project is stated as running and the link to the Editor is available but never get to work.
Problems description
There are 2 problems with the way how k8s resources are provisioned and deprovisioned, for the Project:
Problems detailed description
In "createProject", it's a race condition between the operations that are almost simultaneously pushed onto concurrent promises list which allows them to execute in any order and priority. Whilst creation of a
service
depends on creation ofdeployment
, and creation ofingress
depends on creation of aservice
. This can lead to situation when the ingress won't be created in first place because service creation is yet not started or in flight, and k8s API will return, in attempt to create the ingress for service, "Not found" error. E..g instead of dealing with a list of concurrent promisesmust be smth like
In "suspend" project handler, the operations of deprovisioning service and ingress are not guaranteed to be accomplished as a result of the handler execution (however the project's state is moved to "suspended" which unlocks consequential state change actions on it, by the clients). This problem is introduced by this commit which aims to deprovision the networking resources associated with the project when the project is suspended. k8s operations that provision/deprovision/modify resource (e.g. 'delete service', 'delete ingress') are "promises" for the client of this API. They just send the request onto queue of requests for the relevant k8s service. Thus in order to guarantee the operation was completed, corresponding resources should be polled in order to ensure they are left in desired state. This polling is already done in "suspend" handler for deployments but not for service and ingress resources.
Expected Behavior
Steps To Reproduce
This is especially simple to reproduce when you try to
change the stack for the project
. Starting from recently, changing the stack automatically suspends and then starts the project.Above steps 100% lead to the Project unreachability via ingress after attempt to change the stack.
To recover, one has to suspend the instance, wait for a bit and then start it again.
Environment
k8s setup